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Preface 


This volume is a thoroughly revised second edition of Evolutionary Genomics: Statistical and 
Computational Methods published in 2012. Like the first edition, the new volume includes 
comprehensive reviews of the most recent and fundamental developments in bioinformatics 
methods for evolutionary genomics and related challenges associated with increasing data 
size, heterogeneity, and its inherent complexity. 

Throughout the volume, prominent authors address the challenge of analyzing and 
understanding the dynamics of complex biological systems, and elaborate on some 
promising strategies that would bring us closer to the ultimate “holy grail” of biology— 
uncovering of the relationships between genotype and phenotype. Consequently, the pre- 
sented collection of peer-reviewed articles also represents a synergy between theoretical and 
experimental scientists from a range of disciplines, working together towards a common 
goal. Once again, the revised volume reiterates the power of taking an evolutionary 
approach to study molecular data. 

This book is intended for scientists looking for a compact overview of the cutting-edge 
statistical and computational methods in evolutionary genomics. The volume may serve as a 
comprehensive guide for both graduate and advanced undergraduate students planning to 
specialize in genomics and bioinformatics. Equally, the volume should be helpful for 
experienced researchers entering genomics from more fundamental disciplines, such as 
statistics, computer science, physics, and biology. In other words, the material presented 
here should suit both a novice in biology with strong statistics and computational skills and a 
molecular biologist with a good grasp of standard mathematical concepts. To cater to 
differences in reader backgrounds, Part Iis composed of educational primers to help with 
fundamental concepts in genome biology (Chapter 1), probability and statistics (Chapter 2), 
and molecular evolution (Chapter 3). As these concepts reappear repeatedly throughout the 
book, the first three chapters will help the neophyte to stay “afloat”. The exercises and 
questions offered at the end of each chapter serve to deepen the understanding of the 
material. 

Part II of this volume focuses on sequence homology and alignment—from aligning 
whole genomes (Chapter 4) to disentangling orthologs, paralogs, and transposable ele- 
ments (Chapters 5 and 6). Part III includes chapters on phylogenetic methods to study 
genome evolution. Chapter 7 presents multispecies coalescent methods for reconciling 
phylogenetic discord between gene and species trees. However, a mathematically convenient 
“binary tree” model does not always live up to scrutiny as numerous evolutionary processes 
act in reticulate (network-like) fashion, complicating the statistical description of evolution- 
ary models and increasing computational complexity, often to prohibitive levels. One 
simplification is to assume that some molecular sequence units (genes, gene segments) 
still evolve in a treelike manner. If so, Chapter 8 describes one practical approach to 
meaningfully summarize the binary tree distributions for a set of genomes as a “forest of 
trees”. Alternatively network-like phylogenetic relationships can be represented by graphs 
(Chapter 9). Dating methods for genome-scale data are discussed in Chapter 10, while 
Chapter 11 provides more examples of non-treelike processes in a comparative review of 
genome evolution in different breeding systems. 
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By disentangling different evolutionary forces acting on genomes, we hope to under- 
stand the origins of biological innovation, which is often thought to be coupled with natural 
selection. After all, how do we explain that, by the words of Darwin, “from so simple a 
beginning endless forms most beautiful and most wonderful have been, and are being, 
evolved”? This is the main topic of Part IV that discusses the methodology for evaluating 
selective pressures on genomic sequences (Chapters 12—14) and genomic evolution in light 
of protein domain architecture and transposable elements (Chapters 15 and 16). Part Vof 
this book is dedicated to population genomics and other omics, with example applications to 
disease. Indeed, as evolution starts in populations, there is much interest in generating and 
studying population genome data for a wide range of species. Chapter 17 discusses models 
for genetic architectures of complex disease and genome-wide association studies for finding 
susceptibility variants. Chapter 18 reviews approaches to study ancestral population geno- 
mics. Chapters 19, 20 and 21 illustrate first principles of analyzing environmental sequences 
and applications to clinical trials and systems genetics. Finally, Part VI concludes the book 
by discussing current bottlenecks in handling and analyzing genomic data. Chapter 22 
focuses on challenges and approaches for large and complex data representation and simul- 
taneous querying of heterogeneous databases. Chapter 23 makes the case for using efficient 
high-performance computing strategies for computationally demanding phylogenetic ana- 
lyses, in particular in the Bayesian framework. Solutions for scalable workflows and sharing 
programming resources are presented in Chapters 24 and 25. 

On behalf of all authors, I hope that this book will become a source of inspiration and 
new ideas for our readers. Wishing you a pleasant reading! 


Wadenswil, Switzerland Maria Anisimova 
Lausanne, Switzerland 


Acknowledgements 


This renewed edition of Evolutionary Genomics: Statistical and Computational Methods is a 
result of a dedicated effort by 94 co-authors of the book representing research institutions 
from nearly two dozen different countries. Special thanks go to almost 50 independent 
reviewers whose constructive and detailed comments have greatly contributed to improving 
the overall quality of the book chapters and the clarity of the presentation. As for the first 
edition of this book, the cover image was made by the author of Chapter 6 and a talented 
photography artist, Wojciech Makatowski, from the University of Miinster, Germany. 

By a mutual agreement between all authors of the book, all chapters are available Open 
Access. Swiss Institute of Bioinformatics (SIB) and Zurich University of Applied Sciences 
(ZHAW) have generously contributed to cover a part of the Open Access publication fees. 
Finally, I would like to thank my colleagues at the Institute of Applied Simulations and the 
School of Life Sciences and Facility Management of ZHAW (Zurich University of Applied 
Sciences) as well as my family for their support and encouragement. 


vii 


Contents 


EE ee E des E T Be ee Chee das SE eee teed D 
EEN EERSTEN AE wee DONS > bis dee Eh SH vii 
Contributors EE xiii 


PART I INTRODUCTION: BIOINFORMATICIAN’S PRIMERS 


1 


2 


Introduction to Genome Biology and Diversity.......................00005- 3 
Noor Youssef, Aidan Budd, and Joseph P. Bielawski 
Probability, Statistics, and Computational Science. 0. eee eee eee 33 


Niko Beerenwinkel and Juliane Siebourg 


A Not-So-Long Introduction to Computational Molecular Evolution......... 71 
Stéphane Aris-Brosou and Nicolas Rodrigue 


Part II GENOMIC ALIGNMENT AND HOMOLOGY INFERENCE 


4 


5 


Whio- Genome Glesener, 3 ée keep Ae Eh Brei ër Ath sheet Pewee ws 121 
Colin N. Dewey 
Inferring Orthology and Paralogy: uk N ANERE EEN aed teri tiipiin 149 


Adrian M. Altenhoff, Natasha M. Glover, and Christophe Dessimoz 


Transposable Elements: Classification, Identification, and Their 

Use As a Tool For Comparative Genomes 5 9 EIER ENIEEEE EEN EES ewes 177 
Wojciech Makatowski, Valer Gotea, Amit Pande, 

and Izabela Makatowska 


Part III PHYLOGENOMICS AND GENOME EVOLUTION 


7 


10 


ll 


Modern Phylogenomics: Building Phylogenetic Trees 

Using the Muluspecies Coalescent Model, scscacicicscveccssraieresiaanevnes 211 
Liang Liu, Christian Anderson, Dennis Pearl, 

and Scott V. Edwards 


Genome-Wide Comparative Analysis of Phylogenetic Trees: 

Lhe Prokaryote Forest Of af, acht sde AE ebe i ac eE e 241 
Pere Puigbò, Yuri I. Wolf, and Eugene V. Koonin 

The Methodology Behind Network Thinking: Graphs 

to Analyze Microbial Complexity and Evolution ......................00005. 271 
Andrew K. Watson, Romain Lannes, Jananan S. Pathmanathan, 

Raphaël Méheust, Slim Karkar, Philippe Colson, Eduardo Corel, 

Philippe Lopez, and Eric Bapteste 

Bayesian Molecular Clock Dating Using Genome-Scale Datasets ............. 309 
Mariodos Reis and Ziheng Yang 

Genome Evolution in Outcrossing vs. Selfing vs. Asexual Species ............. 331 
Sylvain Glémin, Clémentine M. François, and Nicolas Galtier 


ix 


xX Contents 


Part IV NATURAL SELECTION AND INNOVATION IN GENOMIC SEQUENCES 


12 Selection Acüng EE 373 
Carolin Kosiol and Maria Anisimova 


13 Looking for Darwin in Genomic Sequences: Validity and Success 
Depends on the Relationship Between Model and Dan. 399 
Christopher T. Jones, Edward Susko, and Joseph P. Bielawski 


l4 Evolution of Viral Genomes: Interplay Between Selection, 
Recombination, and Other Forces 2.65 icc ccd REESEN ES SEA 427 
Stephanie J. Spielman, Steven Weaver, Stephen D. Shank, 
Brittany Rife Magalis, Michael Li, and Sergei L. Kosakovsky Pond 

15 Evolution of Protein Domain Architectures ............... ccc cece eee eens 469 
Sofia K. Forslund, Mateusz Kaduk, and Erik L. L. Sonnhammer 


16 New Insights on the Evolution of Genome Content: Population 
Dynamics of Transposable Elements in Flies and Humans ................... 505 
Lain Guio and Josefa González 


Part V POPULATION GENOMICS AND OMICS IN LIGHT OF DISEASE 
AND EVOLUTION 


17 Association Mapping and Disease: Evolutionary Perspectives................. 533 
Soren Besenbacher, Thomas Mailund, Bjarni J. Vilhjálmsson, 
and Mikkel H. Schierup 


13 Ancestral E EE ARE EARS 555 
Julien Y. Dutheil and Asger Hobolth 

19 Introduction to the Analysis of Environmental Sequences: 
Metapcnonics Viti MEGAN BEER 591 
Caner Bagct, Sina Beier, Anna Górska, and Daniel H. Huson 

20 Multiple Data Analyses and Statistical Approaches 
for Analyzing Data from Metagenomic Studies and Clinical Trials ............ 605 
Suparna Mitra 

21 Systems Genetics for Evolutionary Studies ASS SEELEN SEENEN RENE NEE 635 


Pjotr Prins, Geert Smant, Danny Arends, Megan K. Mulligan, 
Rob W. Williams, and Ritsert C. Jansen 


Part VI HANDLING GENOMIC Data: RESOURCES AND COMPUTATION 


22 Semantic Integration and Enrichment of Heterogeneous 


Biplogical Databases cic4ecdscaes caw dE RE SE ege 655 
Ana Claudia Sima, Kurt Stockinger, Tarcisio Mendes de Farias, 
and Manuel Gil 
23 High-Performance Computing in Bayesian Phylogenetics 
and Phylodyiamics Using BEAGLE NEE pence eee bu SA A ena s 691 


Guy Baele, Daniel L. Ayres, Andrew Rambaut, 
Marc A. Suchard, and Philippe Lemey 


Contents xi 


24 Scalable Workflows and Reproducible Data Analysis for Genomics ............ 723 
Francesco Strozzi, Roel Janssen, Ricardo Wurmus, Michael R. Crusoe, 
George Githinji, Paolo Di Tommaso, Dominique Belhachemi, 
Steffen Moller, Geert Smant, Joepde Ligt, and Pjotr Prins 


25 Sharing Programming Resources Between Bio* Projects. .........sususununna 747 
Raoul J. P. Bonnal, Andrew Yates, Naohisa Goto, Laurent Gautier, 
Scooter Willis, Christopher Fields, Toshiaki Katayama, and Pjotr Prins 


Contributors 


ADRIAN M. ALTENHOFF « Computer Science Department, ETH Zurich, Zurich, Switzerland; 
Swiss Institute of Bioinformatics, Lausanne, Switzerland 

CHRISTIAN ANDERSON + Advantage Testing of Boston, Newton Centre, MA, USA 

Maria ANISIMOVA « Institute of Applied Simulation, School of Life Sciences and Facility 
Management, Zurich University of Applied Sciences (ZHAW), Wadenswil, Switzerland; 
Swiss Institute of Bioinformatics, Lausanne, Switzerland 

Danny ARENDS « Animal Breeding Biology and Molecular Genetics, Albrecht Daniel Thaer- 
Institute for Agricultural and Horticultural Sciences, Humboldt University zu Berlin, 
Berlin, Germany 

STEPHANE ArIs-BRosou + Department of Biology, University of Ottawa, Ottawa, ON, 
Canada; Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, 
Canada 

Daniel L. Ayres « Center for Bioinformatics and Computational Biology, University of 
Maryland, College Park, MD, USA 

Guy BaELE « Department of Microbiology and Immunology, Rega Institute, KU Leuven, 
Leuven, Belgium 

CANER BaGcr « Algorithms in Bioinformatics, Faculty of Computer Science, University of 
Tübingen, Tübingen, Germany 

Eric BAPTESTE « Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université 
Paris 6, Paris, France 

Niko BEERENWINKEL + Department of Biosystems Science and Engineering, ETH Zurich, 
Basel, Switzerland 

Sina BEIER » Algorithms in Bioinformatics, Faculty of Computer Science, University of 
Tübingen, Tübingen, Germany 

DOMINIQUE BELHACHEMI « Life Technologies, Waltham, MA, USA 

SØREN BESENBACHER « Department of Clinical Medicine (MOMA), Aarhus University, 
Aarhus, Denmark 

JosErH P. BIELAWSKI » Department of Biology, Dalhousie University, Halifax, NS, Canada; 
Department of Mathematics & Statistics, Dalhousie University, Halifax, NS, Canada 

Raout J. P. BONNAL » Istituto Nazionale Genetica Molecolare INGM Romeo ed Enrica 
Invernizzi, Milan, Italy 

AIDAN BUDD « Structural and Computational Biology (SCB) Unit, European Molecular 
Biology Laboratory (EMBL), Heidelberg, Germany 

Pure Corson » Fondation Institut Hospitalo-Universitaire Méditerranée Infection, Pole 
des Maladies Infectieuses et Tropicales Clinique et Biologique, Fédération de Bactériologie- 
Hygiéne-Virologie, Centre Hospitalo- Universitaire Tione, Assistance Publique-Hopitaux 
de Marseille, Marseille, France; Unité de Recherche sur les Maladies Infectieuses et 
Tropicales Emergentes (URMITE) UM63, CNRS 7278, IRD 198, INSERM U1095, Aix- 
Marseille University, Marseille, France 

EDUARDO COREL « Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université 
Paris 6, Paris, France 

MıcnarL R. Crusoe « Common Workflow Language Project, Vilnius, Lithuania 


xiii 


xiv Contributors 


TARCISIO MENDES DE Farias « University of Lausanne, Lausanne, Switzerland; SIB Swiss 
Institute of Bioinformatics, Lausanne, Switzerland 

JOEP DE LIGT « Department of Genetics, Center for Molecular Medicine, University Medical 
Center Utrecht, Utrecht University, Utrecht, The Netherlands 

CHRISTOPHE DESSIMOZ + Swiss Institute of Bioinformatics, Lausanne, Switzerland; 
Department of Computational Biology, University of Lausanne, Lausanne, Switzerland; 
Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland; 
Department of Genetics, Evolution and Environment, University College London, London, 
UK; Department of Computer Science, University College London, London, UK 

Coun N. Dewey » Department of Biostatistics and Medical Informatics, University of 
Wisconsin-Madison, Madison, WI, USA 

Paoro Dı Tommaso » Centre for Genomic Regulation (CRG), The Barcelona Institute for 
Science and Technology, Barcelona, Spain 

Mario pos Hrs « School of Biological and Chemical Sciences, Queen Mary University of 
London, London, UK 

JULIEN Y. DuTHEIL » Department of Evolutionary Genetics, Max Planck Institute of 
Evolutionary Biology, Plön, Germany 

PETER EBERT « Max Planck Institute for Informatics, Saarbrücken, Saarland, Germany 

Scott V. EDWARDS « Department of Organismic and Evolutionary Biology & Museum of 
Comparative Zoology, Harvard University, Cambridge, MA, USA 

CHRISTOPHER FIELDS « Institute for Genomic Biology, University of Illinois at Urbana- 
Champaign, Urbana, IL, USA 

Soria K. FORSLUND - EMBL Heidelberg, Heidelberg, Germany; Max Delbrück Centre for 
Molecular Medicine, Berlin, Germany 

CLÉMENTINE M. Francois » Institut des Sciences de ’Evolution, UMR5554, Université 
Montpellier II, Montpellier, France 

NICOLAS GALTIER « Institut des Sciences de PEvolution, UMR5554, Université Montpellier I, 
Montpellier, France 

LAURENT GAUTIER « DMAC, Center for Biological Sequence Analysis, Department of Systems 
Biology, Technical University of Denmark, Kongens Lyngby, Denmark 

Manuer Gm, ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland; SIB 
Swiss Institute of Bioinformatics, Lausanne, Switzerland 

GEORGE GITHINJI » KEMRI Wellcome Trust Research Programme, Kilifi, Kenya 

SYLVAIN GLEMIN « Institut des Sciences de PEvolution, UMR5554, Université Montpellier II, 
Montpellier, France 

NatasHa M. GLover , Swiss Institute of Bioinformatics, Lausanne, Switzerland; 
Department of Computational Biology, University of Lausanne, Lausanne, Switzerland; 
Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland 

JOSEFA GONZALEZ + Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 
Barcelona, Spain 

ANNA GOrSKA « Algorithms in Bioinformatics, Faculty of Computer Science, University of 
Tubingen, Tübingen, Germany 

VALER GOTEA « National Human Genome Research Institute, National Institutes of Health, 
Bethesda, MD, USA 

Naouisa Goto - Department of Genome Informatics, Genome Information Research 
Center, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan 

Lain Guio « Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 
Barcelona, Spain 


Contributors XV 


ASGER HOBOLTH « Bioinformatics Research Center (BiRC), Aarhus University, Aarhus, 
Denmark 

DaniEL H. Huson « Algorithms in Bioinformatics, Faculty of Computer Science, University 
of Tübingen, Tübingen, Germany 

RITSERT C. JANSEN « Groningen Bioinformatics Centre, GBB, University of Groningen, 
Groningen, Netherlands 

ROEL JANSSEN » Department of Genetics, Center for Molecular Medicine, University Medical 
Center Utrecht, Utrecht University, Utrecht, The Netherlands 

CHRISTOPHER T. JONES » Department of Mathematics and Statistics, Dalhousie University, 
Halifax, NS, Canada 

Mateusz KADUK « Department of Biochemistry and Biophysics, Stockholm Bioinformatics 
Centre, Science for Life Laboratory, Stockholm University, Solna, Sweden 

SLIM KARKAR « Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université 
Paris 6, Paris, France; Department of Ecology, Evolution, and Natural Resources, School of 
Environmental and Biological Sciences, Rutgers, The State University of NJ, New 
Brunswick, NJ, USA 

TOsHIAKI KATAYAMA « Database Center for Life Science, Joint Support-Center for Data 
Science Research, Research Organization of Information and Systems, Chiba, Japan 

EUGENE V. Koonin « National Center for Biotechnology Information, National Library of 
Medicine, National Institutes of Health, Bethesda, MD, USA 

SERGEI L. Kosaxovsky POND « Institute for Genomics and Evolutionary Medicine, Temple 
University, Philadelphia, PA, USA 

CAROLIN KosIoL e Centre of Biological Diversity, School of Biology, University of St Andrews, 
Fife, UK; Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria 

RoMAIN Lannes + Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université 
Paris 6, Paris, France 

Pups LEMEY « Department of Microbiology and Immunology, Rega Institute, KU Leuven, 
Leuven, Belgium 

Micar Li . Institute for Genomics and Evolutionary Medicine, Temple University, 
Philadelphia, PA, USA 

Liane Liu . Department of Statistics, University of Georgia, Athens, GA, USA 

PHILIPPE Lopez « Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC Université 
Paris 6, Paris, France 

BRITTANY Rire MaGatis » Institute for Genomics and Evolutionary Medicine, Temple 
University, Philadelphia, PA, USA 

THomas MaILuND » Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark 

IZABELA MAKAŁOWSKA « Institute of Anthropology, Adam Mickiewicz University, Poznan, 
Poland 

WOJCIECH MAKAŁOWSKI » Institute of Bioinformatics, University of Muenster, Muenster, 
Germany 

RAPHAËL MÉHEUST » Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC 
Université Paris 6, Paris, France 

SUPARNA MITRA « Leeds Institute of Medical Research, University of Leeds, Microbiology, Old 
Medical School, Leeds General Infirmary, Leeds LSI 3EX, West Yorkshire, UK 

STEFFEN MOLLER » Institute for Biostatistics and Informatics in Medicine and Ageing 
Research (IBIMA), Rostock University Medical Center, Rostock, Germany 

MecaN K. MULLIGAN - Department of Genetics, Genomics and Informatics, The University 
of Tennessee Health Science Center, Memphis, TN, USA 


xvi Contributors 


AMIT PANDE « Institute of Bioinformatics, University of Muenster, Muenster, Germany 

JANANAN S. PATHMANATHAN » Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC 
Université Parts 6, Paris, France 

DENNIS PEARL » Department of Statistics, Pennsylvania State University, University Park, 
PA, USA 

Pyotr Prins .« Department of Genetics, Center for Molecular Medicine, University Medical 
Center Utrecht, Utrecht University, Utrecht, The Netherlands; Department of Genetics, 
Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, 
TN, USA; Laboratory of Nematology, Department of Plant Science, Wageningen 
University, Wageningen, The Netherlands 

PERE Purcso » National Center for Biotechnology Information, National Library of 
Medicine, National Institutes of Health, Bethesda, MD, USA; Division of Genetics and 
Physiology, Department of Biology, University of Turku, Turku, Finland 

ANDREW RAMBAUT « Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, 
UK 

Nicoias RODRIGUE » Department of Biology, Carleton University, Ottawa, ON, Canada; 
Institute of Biochemistry, Carleton University, Ottawa, ON, Canada; School of 
Mathematics and Statistics, Carleton University, Ottawa, ON, Canada 

Mixxer H. Scurerur « Bioinformatics Research Centre, Aarhus University, Aarhus, 
Denmark 

STEPHEN D. SHANK « Institute for Genomics and Evolutionary Medicine, Temple University, 
Philadelphia, PA, USA 

JULIANE SIEBOURG + Department of Biosystems Science and Engineering, ETH Zurich, Basel, 
Switzerland 

ANA CLAUDIA SIMA « ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland; 
University of Lausanne, Lausanne, Switzerland 

GEERT SMANT » Laboratory of Nematology, Department of Plant Science, Wageningen 
University, Wageningen, the Netherlands 

ERIK L. L. SONNHAMMER + Department of Biochemistry and Biophysics, Stockholm 
Bioinformatics Centre, Science for Life Laboratory, Stockholm University, Solna, Sweden 

STEPHANIE J. SPIELMAN + Institute for Genomics and Evolutionary Medicine, Temple 
University, Philadelphia, PA, USA 

Kurt STOCKINGER « ZHAW Zurich University of Applied Sciences, Winterthur, Switzerland 

FRANCESCO STROZZI « Enterome Bioscience, Paris, France 

Marc A. SUCHARD » Department of Human Genetics and Biomathematics, David Geffen 
School of Medicine, University of California, Los Angeles, CA, USA 

Epwarb Susko » Department of Mathematics and Statistics, Dalhousie University, Halifax, 
NS, Canada 

BJARNI J. VILHJALMSSON « Bioinformatics Research Centre, Aarhus University, Aarhus, 
Denmark 

ANDREW K. Watson « Sorbonne Universités, Institut de Biologie Paris-Seine, UPMC 
Université Paris 6, Paris, France 

STEVEN WEAVER « Institute for Genomics and Evolutionary Medicine, Temple University, 
Philadelphia, PA, USA 

Ros W. Witiiams » Department of Genetics, Genomics and Informatics, The University of 
Tennessee Health Science Center, Memphis, TN, USA 

ScooTER WILLIS » Department of Computer & Information Science & Engineering, 
University of Florida, Gainesville, FL, USA 


Contributors xvii 


Yux I. Wor, National Center for Biotechnology Information, National Library of 
Medicine, National Institutes of Health, Bethesda, MD, USA 

Ricarpo Wurmus + BIMSB Scientific Bioinformatics Platform, Max Delbrück Center for 
Molecular Medicine, Berlin, Germany 

ZIHENG YANG + Department of Genetics, Evolution and Environment, University College 
London, London, UK 

ANDREW YATES « European Molecular Biology Laboratory, European Bioinformatics 
Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK 

Noor Yousser « Department of Biology, Dalhousie University, Halifax, NS, Canada 


Part | 


Introduction: Bioinformatician’s Primers 


Check for 
updates 


Introduction to Genome Biology and Diversity 


Noor Youssef, Aidan Budd, and Joseph P. Bielawski 


Abstract 


Organisms display astonishing levels of cell and molecular diversity, including genome size, shape, and 
architecture. In this chapter, we review how the genome can be viewed as both a structural and an 
informational unit of biological diversity and explicitly define our intended meaning of genetic information. 
A brief overview of the characteristic features of bacterial, archaeal, and eukaryotic cell types and viruses sets 
the stage for a review of the differences in organization, size, and packaging strategies of their genomes. We 
include a detailed review of genetic elements found outside the primary chromosomal structures, as these 
provide insights into how genomes are sometimes viewed as incomplete informational entities. Lastly, we 
reassess the definition of the genome in light of recent advancements in our understanding of the diversity 
of genomic structures and the mechanisms by which genetic information is expressed within the cell. 
Collectively, these topics comprise a good introduction to genome biology for the newcomer to the field 
and provide a valuable reference for those developing new statistical or computation methods in genomics. 
This review also prepares the reader for anticipated transformations in thinking as the field of genome 
biology progresses. 


Key words Organism diversity, Viruses, Prokaryotes, Eukaryotes, Organelles, DNA, RNA, Protein, 
Regulatory DNA, Epigenetics, Plasmids, Transcription, Translation, DNA replication, Chromatin, 
Gene structure 


1 Introduction 


Following the introduction of the concept of the genome in 1920 
[1], the field of genome science has grown to encompass a vast 
range of interconnected topics (e.g., nucleic acid chemistry, molec- 
ular structure, replication and expression biochemistry, mutational 
processes, evolutionary dynamics, and interactions with cellular 
processes). Although the notion of the genome as a fundamental 
biological unit has been with us for nearly a century, it is only within 
the last decade that genomics has emerged as a transformative 
discipline within biology and the health sciences [2]. Its rapid 
development was in large part due to advances in massively parallel 
next-generation sequencing [3], which yielded unprecedented 
levels of genomic data. Those data revealed extensive natural 
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variation in the way that genomes are structured and processed. 
This led modern biologists to reevaluate the fundamental definition 
of the genome. 

The typical definition of the genome is often dualistic, referen- 
cing both structural features and its function to store and transmit 
biological information [4]. For example, the US National Institutes 
of Health (NIH) uses the following definition: “A genome is an 
organism’s complete set of DNA, including all of its genes. Each 
genome contains all of the information needed to build and main- 
tain that organism. In humans, a copy of the entire genome—more 
than three billion DNA base pairs—is contained in all cells that have 
a nucleus.” This conception, as with many others, is structural with 
regard to physical features (viz., genes and DNA base pairs) and 
informational with regard to its role in carrying out cellular func- 
tions (viz., to build and maintain the organism). Through increased 
knowledge of genome diversity, the field has come to realize that 
both conceptions of the genome are sometimes insufficient [4]. We 
now understand that the physical structures of the genome can be 
transient and that the expression of information contained within a 
genome is often conditioned on non-genomic factors. The science 
of genome biology is entering a new era based on a deeper under- 
standing of the relationship between genotype and phenotype [5]. 

The purpose of this review is to provide a condensed overview 
of genome biology and to anticipate transformations in thinking 
that will occur as the field progresses. The remainder of this article 
is structured into four parts, with the next section providing a brief 
overview of the diversity of organismal cell types. The two 
subsequent sections introduce the structural and informational 
aspects of genomes, respectively. In the final section, we reassess 
the definition of the genome through selected biological examples 
and conclude with an updated perspective on the nature of the 
genome as an informational entity. 


2 Organism Diversity and Cell Types 


Cells are the smallest living unit of an organism. All cells have three 
attributes in common: cell membrane, cytoplasm, and genome. 
Structurally, cells can be divided into two basic types: prokaryotic 
and eukaryotic cells. Eukaryotic cells tend to be more complex. 
They possess a nucleus and other membrane-bound organelles, 
which are specialized components in the cell that perform unique 
functions (e.g., nucleus, mitochondria, plastids). Conversely, pro- 
karyotic cells lack membrane-bound organelles. Although similar in 
cell structure, prokaryotes include two fundamentally distinct 
domains: the eubacteria (true bacteria, often referred to simply as 
bacteria) and the archaea. 


2.1 Viruses 


2.2 Bacteria 
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Cellular life is detected in almost every environment on Earth. 
As life has colonized and adapted to the vast number of niches, cells 
have evolved an incredible amount of diversity in regard to size [6], 
form [7], lifestyle [8, 9], and complexity [10]. Understanding the 
basis of such diversity remains one of the central aims of biology. 
Readers interested in the latest understanding of Earth’s biodiver- 
sity, the unique characteristics ofits organisms, and how both extant 
and extinct forms are related to each other are encouraged to explore 
the following resources: the University of California Museum 
of Paleontology “History of life through time” exhibit [11], 
the Tree of Life Web Project [12], the Encyclopedia of Life [13]. 


Viruses are infectious agents of living cells that are unable to repro- 
duce in the absence of a host. Viruses are not considered cellular 
entities since they lack two of the essential attributes that define a 
cell; they possess neither a cell membrane nor cytoplasm. The 
discovery of virophages, viruses that parasitize other viruses, resur- 
rected the debate on their classification as living organisms 
[14]. Some consider viruses to be living entities since they can be 
hosts to other viruses, with a virophage infection leading to the 
eventual death of the host virus, implying an initial “living” state 
[15]. The opposing view asserts that a virus’ inability to reproduce 
outside of a cellular host makes them nonliving entities 
[16, 17]. Irrespective of their delineation as living or nonliving, 
viruses are relevant to this review as they possess genomes and are 
the most abundant biological replicators in the biosphere [18]. 

Outside of their host, viruses exist as viral particles (virions) 
consisting of a protein capsule that protects and encloses their 
genome. Once a virion has entered a host cell, it “hijacks” the 
host’s cellular structures and processes to carry out the metaboli- 
cally active phase of the viral life cycle. At this stage, the virus 
exhibits physiological properties reminiscent of living cells; they 
metabolize, grow, and reproduce. There is a wide range of viral 
lifestyles, with corresponding diversity in viral forms, sizes, hosts, 
and genomes [16]. The largest known virus, the mimivirus, was 
originally identified as an infectious agent of an amoeba [19] and 
can itself become a host for virophages [14]. To put this in context, 
the virion of a mimivirus can be larger than some prokaryotic cells 
[16]. At the other end of the scale are viruses such as the circo- 
viruses, some of which have small genomes made up of less than 
2000 nucleotides [20]. A more detailed account of viral diversity 
can be found at the ViralZone website [21 ]. 


The bacterial cell is prokaryotic, and it is relatively simple as com- 
pared to eukaryotic cells. It has no membrane-bound organelles, 
and the chromosome (usually one) is not separated from the other 
components of the cell. While predominantly unicellular, they often 
live in biofilms, a community of cells bound together by a secreted 
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2.3 Archaea 


polymer matrix [22], displaying a range of cooperative behaviors 
[23]. They can also exhibit regulated differentiation into different 
cell types, where two cells with the same genome have different 
morphology and function [22, 24]. 

Only a very small fraction of bacterial diversity (less than 1%) 
can be cultured and grown in the laboratory [25]. The problem of 
uncultivable bacteria is a consequence of our limited knowledge of 
their physiological diversity and the interactions necessary for their 
growth [26]. To this end, efforts are being made to study bacteria 
in nature [27-29] but with limited progress given the immense 
metabolic diversity of bacteria. Even within the incomplete sam- 
pling of cultivable bacteria, there is considerable diversity in cell 
shape [30], mode of reproduction [9], and cell cycle 
regulation [31]. 

The bacterial cell cycle involves the coordination of genome 
replication and segregation of replicated copies into daughter 
cells, followed by cell division. In this way, the transmission of 
genetic material is “vertical” from one cell generation to the next. 
Under certain conditions, some bacteria, such as E. coli, can initiate 
a new round of genome replication prior to completion of cell 
division [32, 33], thereby resulting in an increase in the number 
of gene copies near the origin of replication as compared to loci 
replicated later [31]. Other bacteria, such as Cauwlobacter, maintain 
a tightly regulated cell cycle to ensure a single replication event per 
division [34]. Under optimal conditions, some species can com- 
plete their cell cycle every 20 min, implying that a single cell could 
produce more than a billion descendants in a mere 10 h. In addition 
to vertical transfer, genetic information can be transferred “hori- 
zontally” between unrelated cells via the processes of transforma- 
tion, conjugation, or transduction [35]. An event that transfers 
gene(s) between different species (or cells) by any of these three 
processes is referred to as a horizontal gene transfer (HGT) event. 


Archaea are single-celled organisms that appear strikingly similar to 
bacteria under light and electron microscopes. Like bacteria they 
often have a single circular chromosome and lack a nucleus, and for 
a long period of time the archaea were wrongly categorized as 
bacteria. The first indication that the archaea might be a separate 
domain of life was obtained from phylogenetic analyses of the 16S 
rRNA gene [36]. Advancements in genome sequencing and analy- 
sis yielded further evidence of the evolutionary distinction between 
the bacterial and archaeal domains [37]. Despite their superficial 
cellular similarity to bacteria, the archaea have many molecular-level 
similarities to eukaryotes, leading researchers to hypothesize that 
the ancestor of the eukaryotes arose within the archaea [38]. 
Previously, archaea were assumed to be a minor group of 
organisms inhabiting extreme environments beyond the tolerance 
of bacteria (salt brines, hydrothermal vents, acidic and anoxic 
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conditions, etc.). Through culture-independent methods, archaea 
were discovered to be much more widespread and metabolically 
diverse. Archaea are now known to inhabit the human gut, and 
through mutualistic community relationships, they play a key role 
in human health and metabolism [39-41]. There is increasing 
evidence for archaea playing a significant role in global nutrient 
cycling [42]. They contribute major mechanisms for anaerobic 
methane oxidation [42], ammonia oxidation [43], and other 
parts of the nitrogen cycle including nitrogen fixation [44]. The 
archaea also appear to be ecologically competitive with bacteria, as 
they make significant contributions to the microbial communities 
of non-extreme soil, aquatic, and marine environments 
[43, 45]. Although they can be highly abundant in such environ- 
ments, archaeal diversity is greatest in the more extreme 
habitats [45 ]. 

Archaea possess an array of bacteria-like, eukaryote-like, and 
archaea-specific features. The archaeal cell wall is chemically and 
structurally diverse, yet they systematically lack a cell wall peptido- 
glycan, murein, that is ubiquitous among the bacteria 
[46, 47]. Their membrane lipids are chemically different from 
those found in either bacteria or eukaryotes [48], and they possess 
many novel enzymes that are required for the biosynthesis of their 
unique membranes [49, 50]. Consequently, most archeoviruses are 
unique to archaea [51]. Even structural appendages that initially 
appeared to be homologous to bacterial appendages are often 
structurally distinct and have different genetic basis than the bacte- 
rial counterparts [52-54]. At the biochemical level, the archaea use 
many sources of energy and are metabolically diverse, probably 
more so than either bacteria or eukaryotes [55]. 


All complex multicellular organisms are eukaryotes (animals, 
plants, fungi, red algae, and brown algae), as are many unicellular 
organisms [56, 57]. Eukaryotic cells are found in a wide diversity of 
sizes and shapes [58, 59]. They are generally larger and have a more 
complex internal organization than the bacteria and archaea. A key 
characteristic of the eukaryotic intracellular organization is the use 
of lipid membranes to separate their contents into different com- 
partments [60, 61]. The bulk of the eukaryotic genetic material is 
surrounded by a nuclear envelope and is thus maintained in a 
separate organelle, the nucleus. This provides a fundamental per- 
spective on how eukaryotic cells differ from bacterial and archaeal 
cells and has important consequences on the expression of eukary- 
otic genetic information. 

In addition to the nucleus, other organelles (mitochondria and 
plastids) contain small genomes that encode additional genes. Both 
mitochondria and plastids originated from ancient endosymbiosis 
events between ancestral eukaryotic cells and bacterial organisms. 
Following these events, the invading bacteria underwent a process 
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of genome reduction in which they transitioned from autonomous 
organisms to cell-dependent organelles [62]. 

Despite our familiarity with plants, animals, and fungi, the vast 
majority of eukaryotic diversity lies outside of those groups and is 
largely microbial [63]. These “other” eukaryotes are collectively 
called protists. They do not form a monophyletic group, i.e., pro- 
tists do not from a phylogenetic group that is comprised of a shared 
common ancestor and all of its descendants [57, 64]. The term 
protist is used largely for convenience to classify all eukaryotes that 
are not plants, animals, or fungi. Protists embody extensive ecolog- 
ical and structural diversity and include several important groups of 
unicellular eukaryotes involved in human diseases [65]. For exam- 
ple, the unicellular apicomplexan eukaryote Plasmodium is the 
causative agent of malaria, which affects around 10% of the world 
population [65]. More positively, protist species are important 
primary producers and are an essential link in the ocean’s biogeo- 
chemical cycles [66]. 


3 Genome Structure and Organization 


The notion of the gene as the physical carrier of hereditary infor- 
mation existed years before its physical and chemical structures 
were known. In 1902, Sutton provided the first clear support for 
the chromosomal theory of inheritance, allocating genes to seg- 
ments on chromosomes [67 ]. The modern view of the gene is more 
often focused on a particular chemical sequence of nucleic acids 
rather than a chromosomal locus, but the two are not independent. 
The genetic instructions encoded within an organism’s nucleic acid 
molecules comprise the organism’s genotype. The physical manifes- 
tation of such genetic information, which will depend on environ- 
mental interactions, comprises the organism’s phenotype. 

There are two types of nucleic acids: deoxyribonucleic acid 
(DNA) and ribonucleic acid (RNA). Both are polymers consisting 
of chains of nucleotides. Each nucleotide includes three compo- 
nents: a 5-carbon sugar, a phosphate group, and a nitrogenous 
base. A nitrogenous base together with the sugar (without the 
phosphate group) is called a nucleoside. The sugar component in 
RNA, ribose, is a normal sugar with one hydroxyl group 
(OH) attached to each carbon atom. Deoxyribose, the sugar pres- 
ent in DNA, differs only in the absence of one oxygen atom at the 
2’ carbon atom (H instead of OH). This chemical difference is 
crucial for enabling enzymes to distinguish between RNA and 
DNA polymers. The 5’ sugar carbon carries a phosphate group 
and is referred to as the 5’ end of the polynucleotide molecule 
(DNA or RNA). The 3’ end has a free hydroxyl (OH) group that 
is available to form chemical bonds with other atoms. As a result, 
synthesis of DNA and RNA in the cell proceed through the 
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addition of a nucleotide to a 3’ terminal hydroxyl group. The 
polynucleotides, therefore, exhibit directionality, and synthesis 
occurs in a 5’ to 3’ direction. 

All living cells employ the double helical structure of DNA as a 
chemical means to store information. Each of the two longitudinal 
strands is an alternating sequence of phosphate and a 5-carbon 
sugar. At each sugar, the two strands are bridged by two nitroge- 
nous bases, one purine molecule (of type adenine [A] or guanine 
[G]) and the other a pyrimidine molecule (of type cytosine [C], 
thymine [T], or uracil [U]). The chemical bridges between purine 
and pyrimidine molecules (called base pairs) are held together by 
hydrogen bonds. Each purine can be complemented by only one 
pyrimidine: A forms two hydrogen bonds with T (or U in RNA) 
and C forms three hydrogen bonds with G. These are referred to as 
the canonical or Watson-Crick pairings. Given this pairing pattern, 
the sequences of the double-stranded DNA are said to be comple- 
mentary, and the sequence of one strand can be deduced from the 
sequence of its complementary strand. The order of the nitroge- 
nous bases in DNA (or RNA) is what confers the meaning of the 
information encoded in the genome. 

A vital feature of genetic information is its ability to be repli- 
cated and passed on to daughter cells. The core mechanisms that 
copy DNA are conserved in all three domains of cellular life: 
bacteria, archaea, and eukaryotes [68]. Accurate DNA replication 
is essential to produce viable offspring—too many alterations in the 
DNA impede the production of functional proteins, thereby 
increasing the chances of nonviable progeny. Therefore, most 
DNA replicates with high fidelity. However, mistakes do occur. In 
humans, on average one error occurs in 30 million bases copied per 
cell division [69]. The cells produced from these altered genes are 
called mutants. 

Although all living things carry DNA, the processes through 
which genetic information is physically transferred from DNA to 
RNA (called transcription) and then used to create a polypeptide 
molecule with a unique sequence of amino acids (called transla- 
tion) differ between domains of life. The lack of membrane-bound 
nuclei in prokaryotes permits the simultaneous occurrence of tran- 
scription and translation [70]. In eukaryotes, those processes are 
separated by the nuclear membrane; DNA is first transcribed to 
RNA in the nucleus, and the RNA product is subsequently trans- 
lated to an amino acid sequence in the cytoplasm, ultimately lead- 
ing to the construction of a protein. 

Organisms from all domains of life, and many of the viruses that 
parasitize them, have a very large genome compared with the size of 
the cell or compartment to which it is confined. For instance, the 
human nuclear DNA consists of approximately three billion base 
pairs; when stretched out, it amounts to about 2 m of total DNA 
per cell. The average human cell size is merely 10 pm. The 
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3.1 Viral Genomes 


impressive ability to store DNA within the cell is possible through a 
process of genome packaging. In eukaryotes and some archaea, the 
DNA wraps around histone proteins to form nucleosomes. In 
humans, this results in a two-million-fold decrease in size, allowing 
the DNA to compact into the nucleus [68]. Prokaryotic DNA 
compaction is achieved using a combination of supercoiling, mac- 
romolecular crowding, and association with DNA-binding proteins 
[71]. The degree of the supercoiling used in prokaryotes varies 
considerably between different species. 

Prokaryotic cells tend to have efficient genomes, with most of 
their genetic material composed of protein-coding regions. 
Archaeal genomes are, on average, more compact than bacterial 
genomes [72]. An increase in prokaryotic genome size is therefore 
often accompanied by an increase in the number of genes encoded. 
This trend is not evident in eukaryotes, for which there is little 
association between genome size and the number of protein- 
coding genes [73]. Consider the E coli genome, more than 90% 
of its DNA encodes proteins. This is in stark contrast with the 
modest 2% protein-coding regions present in human DNA 
[74]. Most eukaryotic genomes are riddled with non-protein-cod- 
ing regions (see Subheading 4.2 for an evolutionary mechanism). 
This results in them having larger genome sizes on average than 
prokaryotic cells [74]. 


Viruses use any combination of either RNA or DNA, either single- 
or double-stranded molecules, in either circular or linear forms, to 
encode their genetic instructions [75, 76]. The viral genetic mate- 
rial is typically referred to as segments rather than chromosomes. 
Viral genomes composed of multiple segments are referred to as 
segmented. When different strains of the same segmented viral 
species infect a cell, genomes from the different strains can mix to 
produce hybrids—a process known as reassortment. Hybrid flus 
such as the HIN1 swine influenza A virus originated in this 
way [77]. 

Viral strains package their genomes in various ways. Most DNA 
and RNA viruses with small genomes (<20 kb) employ energy- 
independent packaging systems where capsid assembly and genome 
condensation are coupled. One example is the RNA genome of the 
HIV retrovirus that, in the mature virion, forms a RNA-protein 
complex with one of the cleavage products of the Gag polyprotein 
[78]. Other viruses, such as the lambda bacteriophage, require ATP 
to pump their genome directly into a preassembled capsid 
[79]. The latter type of machinery is ubiquitous in bacterial viruses. 
Alternatively, large viruses package their genome using histone-like 
proteins that are critical for eukaryotic genome packaging [80]. For 
a review on genome packaging in viruses, see ref. 81. 


3.2 Bacterial 
Genomes 


3.3 Archaeal 
Genomes 
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Despite not being confined within a membrane-bound compart- 
ment, the prokaryotic genome will be unevenly distributed 
throughout the cell. It often clusters in an irregularly shaped vis- 
cous region known as the nucleoid that makes up about a quarter of 
the intracellular volume [82]. The organization and distribution of 
the nucleoid are dynamic and dependent on the growth rate and 
presence of antibiotics [83]. 

It was previously thought that all bacterial cells possessed a 
single circular chromosome. In 1989, the first linear bacterial chro- 
mosome was discovered in the spirochaete Borrelia burgdorferi, the 
causative agent of Lyme disease [84, 85]. Additionally, recent 
advancements have revealed that many cells retain multiple circular 
or linear chromosomes [86]. These often consist of a primary 
chromosome, which is larger and harbors a higher density of essential 
genes compared to the secondary chromosome (s) [87]. 

The replication of bacterial DNA initiates at a well-defined 
sequence, called the origin of replication. The proteins involved in 
replication bind to the origin site and DNA synthesis proceeds in 
both directions. Circular chromosomes require a single origin, and 
replication is terminated by either a stop signal or when the two 
replication forks meet [88 ]. Linear bacterial chromosomes typically 
have a central origin, and replication proceeds bidirectionally much 
as in circular chromosomes. However, replication enzymes are 
unable to synthesize new DNA at the ends ofa linear chromosome, 
and this results in the gradual shortening of DNA after each repli- 
cation event [89]. Linear chromosomes, therefore, require terminal 
structures known as telomeres to protect against DNA degradation. 
Telomeres are characterized by the presence of multiple tandem 
repeats of short noncoding nucleotide sequences. 

Linear prokaryotic chromosomes have evolved two different 
types of telomeres [90]. The first, best understood in the strepto- 
mycetes, uses a terminal protein complex covalently attached to the 
5’ end of the DNA molecules. During replication, DNA polymerase 
binds the first synthesized nucleotide directly to the terminal pro- 
tein. This replication strategy allows for the complete duplication of 
the linear molecule with no loss of genetic information [91]. The 
second type, best studied in the spirochetes, involves the formation 
of closed hairpin structures at the termini [92]. Replication of the 
linear DNA proceeds as expected. Once duplication of each DNA 
strand is completed the newly synthesized DNA are temporarily still 
attached—forming a structure superficially resembling a circular 
chromosome. A specific enzyme is then recruited to separate the 
two linear strands and re-form the telomeres [93]. For an overview 
of telomeric structures, see ref. 94. 


Archaeal genomes share features with both bacteria and eukaryotes. 
Archaea typically possess circular chromosomes reminiscent of bac- 
teria genomes; some have a single chromosome and a single origin 
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3.4 Eukaryotic 
Genomes 


of replication, while other species have multiple chromosomes and 
multiple origins on each [95, 96]. Given that archaea have the 
prokaryote cell type that lacks membrane-bound organelles (and 
hence nuclei), they are similar to bacteria in permitting the simul- 
taneous occurrence of transcription and translation. Nonetheless, 
there are fundamental differences from the bacteria in the proces- 
sing of genomic information. The initiation of amino acid synthesis 
in archaea more closely resembles that used in the eukaryotic tran- 
scription process. Additionally, the core archaeal transcription 
machineries are more closely related to eukaryotes [97, 98]. 
Archaeal and eukaryotic DNA replication and repair systems have 
also been shown to have many features in common [99]. 

Relatively little is known about the structure of archaeal gen- 
omes [100], but some are packaged into chromatin via histone 
proteins. Chromatin is a compact and organized chromosome 
structure that consists of DNA in close association with proteins. 
Interestingly, this form of chromatin is present in all eukaryotes and 
missing from bacteria [101]. Among the archaea that use histones 
(De, Thermoproteales and Euryarchaea), the geometry of their 
histone-mediated chromatin is the same as in eukaryotes [102]. 
However, archaeal histones are often shorter than the eukaryotic 
histones [101]. Groups of archaea that lack histones (e.g., Cre- 
narchaea) encode other DNA-binding proteins associated with 
the architecture of bacterial chromatin [100]. Another family of 
DNA-binding proteins called Alba (acetylation lowers binding 
affinity) is ubiquitous among archaea. They are abundant small 
proteins that facilitate genome compaction, play a key role in 
determining the architecture of archaeal chromatin, and regulate 
gene expression on a genomic scale [101]. Alba proteins have been 
detected in both histone-lacking and __histone-containing 
archaea [103]. 


Eukaryotes sequester their linear chromosomes within a 
membrane-bound nucleus. Linear eukaryotic chromosomes have 
three essential structural elements: a centromere, a pair of telo- 
meres, and origins of replication. The centromere is the attachment 
point for spindle microtubules—the filaments responsible for phys- 
ically moving chromosomes during cell division. Telomeres are the 
protective ends of a linear chromosome. The origins of replication 
are the sites where DNA synthesis begins. Eukaryotes typically have 
multiple linear chromosomes, each with many origins of replica- 
tion. The larger genome size and slower replication machinery in 
eukaryotes necessitate the need of multiple origins to speed up the 
replication process. 

In eukaryotic cells, nuclear DNA compaction involves the asso- 
ciation of DNA with the protein products of a family of genes, the 
histones, whose sequence variants provide for a variety of different 
functions. The eukaryotic chromosome is organized at the lowest 


3.5 Auxiliary DNA 
Structures 


Introduction to Genome Biology and Diversity 13 


level by wrapping the DNA around histones, forming nucleosomes. 
This structure constitutes the basic unit of the chromatin fiber, 
which is further organized into higher-order structures mediated 
by other proteins [104, 105]. Sequence variation in histones, in 
combination with posttranslational modification of the protein, 
affects the structural properties of chromosomal nucleosomes and 
gene expression. 

Eukaryotic DNA consists of at least three types of sequences: 
unique-sequence DNA, moderately repetitive DNA, and highly 
repetitive DNA. Unique-sequence DNA are regions that are present 
only once or at most a few times in the genome. Most protein- 
coding regions fall within this category. Alternatively, more than 
half of the total DNA in all eukaryotic genomes is made up of 
repeated sequence motifs that are either moderately or highly 
repetitive [106]. Moderately repetitive DNA are sequences from 
160 to 180 base pairs (bp) in length that are repeated thousands 
of times [106]. Some of these sequences perform important func- 
tions for the cell, such as coding for types of RNA [107]. Highly 
repetitive DNA are short sequences, less than 60 bp that are present 
in hundreds of thousands of copies repeated throughout the 
genome. Repeats that are 2-10 bp are known as microsatellites, 
whereas motifs that are 10—60 bp are termed minisatellites [108]. 

Most of the repetitive sequences arise through transposition 
(see Subheading 4.2). The repeated sequences can be found either 
in tandem arrays, i.e., appearing adjacent to each other, or inter- 
spersed throughout the genome. The evolution and maintenance 
of nonfunctional repeated sequences have spurred the interest of 
genome scientists, with some classifying these motifs as selfish-genes 
that reproduce to propagate themselves and provide no positive 
contribution to the organism’s phenotype or fitness [106]. Repeats 
also represent technical challenges for bioinformaticians developing 
software for sequence alignment and genome assembly. From a 
computational perspective, repeats create ambiguities that are chal- 
lenging to resolve. For a review on computational challenges and 
solutions, see ref. 108. 


Both prokaryotes and eukaryotes have secondary chromosomal 
structures. For eukaryotes, this refers to any form of DNA found 
outside of a nucleus—although the discovery of microDNA 
extends this classification [109]. Eukaryotic auxiliary DNA often 
contains essential genes that are necessary for normal cell produc- 
tion. For example, the DNA chromosome located within the mito- 
chondrial organelle encodes genes that are involved in oxidative 
phosphorylation and the creation of different types of RNA 
[110]. For prokaryotes, auxiliary DNA refers to any DNA that is 
not associated with the primary chromosome, and unlike eukar- 
yotes, the genes encoded in such DNA are often dispensable. For 
example, small circular chromosomes, called plasmids, often 
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3.5.1 Mitochondrial DNA 


3.5.2 Plastid DNA 


contain genes that allow the bacterium to survive various environ- 
mental conditions; however they are not usually essential for nor- 
mal cell function [110]. 


The mitochondrion is a double membrane-bound organelle that is 
ubiquitous in eukaryotic cells. There is only one known case of a 
eukaryotic cell able to survive without a mitochondrion 
[111]. Mitochondria are essential because they are the site of 
production for most of the cell’s energy, which is produced as 
ATP by the oxidative phosphorylation metabolic pathway. Addi- 
tionally, the mitochondrion is the site of iron-sulfur (Fe/S) cluster 
assembly. Fe/S clusters are protein cofactors that are essential for 
various extramitochondrial pathways [112]. The mitochondria- 
lacking eukaryote, a species of Monocercomonoides, is unique in 
that it lives only within the intestine of the chinchilla and has 
evolved different strategies for Fe /S cluster formation and obtain- 
ing energy absorbed from its environment [111]. 

Mitochondria are the derivatives of prokaryotic cells that were 
engulfed by a common ancestor of all eukaryotes. The DNA within 
these organelles are the remnants of the DNA genome of the 
ancestral prokaryotic endosymbiont. Thus, the mitochondrial 
DNA (mtDNA) more closely resembles a prokaryotic genome. 
For example, in most animals and fungi, mtDNA consists ofa single 
circular chromosome. However, small linear mtDNA chromo- 
somes with defined telomeres have been identified within various 
protists, animals, and fungi [113, 114]. Additionally, the architec- 
ture of mtDNA is not determined by histones but instead by a set of 
small DNA-binding proteins that induce structures analogous to 
the bacterial chromatin. Mitochondrial genomes have been cate- 
gorized into six different types depending on shape, size, structure, 
and number (see ref. 115). 

In humans, the mitochondrial genome encodes 13 of the 
80 proteins that are directly involved in oxidative phosphorylation. 
The remaining proteins are encoded in the nuclear chromosomes 
[110]. The exact contribution from mitochondrial and nuclear 
genomes varies across eukaryotes. Nonetheless, in the vast majority 
of known eukaryotic species, the mtDNA is essential to produce 
important proteins involved in energy production, demanding that 
all cells have faithfully inherited the mtDNA. 


Plastids are similarly derived from an endosymbiosis with a bacte- 
rium, with the organelle retaining remnants of that ancestral bacte- 
rial genome. Like the mitochondrion, the plastid is a double 
membrane-bound cytoplasmic organelle. Unlike the mitochon- 
drion, plastids often contain pigment used in photosynthesis. Plas- 
tids are found in the cytoplasm of protists and all higher plants. 
Plastid DNA (ptDNA) is highly reduced relative to the genomes of 
extant photosynthetic bacteria. In part, the reduction in genome 
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size is due to gene loss with some regions excised and incorporated 
into the host nuclear DNA [116]. The ptDNA encodes important 
proteins that are essential for cell viability [117]. Almost all plastids 
have circular DNA, with the alveolate Chromera velia being the 
single known case of linear ptDNA. The linear extrachromosomal 
ptDNA has a telomere arrangement resembling those of linear 
mtDNA [117, 118]. 

Genes encoded in ptDNA are involved in the synthesis and 
storage of various cellular components, including those necessary 
for photosynthesis. Plastids have diverged to carry out different 
functions with multiple types identified. For example, chloroplasts 
are specialized for carrying out photosynthesis; chromoplasts con- 
tain pigments that provide petal colors, whereas amyloplasts are 
used for bulk storage of starch [117]. 


A nucleomorph is a vestigial eukaryotic nucleus found in crypto- 
monads and chlorarachniophytes, which are both plastid- 
containing algae. The nucleomorph is located in these organisms 
between the inner and outer membranes of the plastid and is 
believed to be derived from the nucleus of an endosymbiotic algal 
cell engulfed by a larger eukaryotic cell [119]. Thus, the plastid 
organelle in this case evolved from two endosymbiotic events: a 
prokaryote was engulfed by a eukaryote which thereby became 
photoautotrophic and that cell was then engulfed by another 
eukaryote. The nucleomorph genomes are extremely small com- 
pared to the typical nuclear genome, being comprised of mostly 
single-copy housekeeping genes and having no mobile elements. 
The nucleomorph genome of the cryptomonads suggests that it 
was derived from a red algal ancestor, whereas the nucleomorph 
genome of the chlorarachniophytes suggests a green algal 
ancestor [119]. 


Plasmids are present in bacteria, archaea, and eukaryotes [120]. 
Most plasmids are circular, although linear plasmids have been 
identified [121]. The genes carried on plasmids tend to be asso- 
ciated with functions that enable or enhance survival and growth 
under specific conditions. They can be horizontally transferred 
between prokaryotic cells and represent an important vehicle for 
sharing genetic information [122]. For example, a plasmid that has 
evolved an antibiotic resistance gene(s) can be transferred to neigh- 
boring bacteria promoting their rapid adaptation to various stresses 
associated with an antibiotic environment. 

The eubacteria E coli is estimated to have more than 270 plas- 
mids having different distributions among and within cells; some 
promote mating, while others contain genes that kill other bacteria. 
The number of plasmids known and sequenced is much higher in 
bacteria as compared to archaea, with the lowest number having 
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been identified in eukaryotes [122]. In recent years, plasmids have 
been used extensively in genetic engineering as a means of introdu- 
cing and modifying target genes [122, 123]. 


3.5.5  MicroDNA In 2012, Shibata et al. discovered a new form of extrachromosomal 
DNA in eukaryotes, called microDNA [122]. In contrast with 
other auxiliary DNA, microDNA is derived from non-repetitive 
sequences that are often associated with functional genes. They 
are circular DNA between 200 and 400 bp and are found in the 
nuclei of mammalian cells [122]. microDNA is thought to be 
associated with the repair and maintenance processes of nuclear 
DNA. It is not yet clear if microDNA plays a functional role in 
these processes or if they are merely an unavoidable by-product. For 
the time being, detection of specific microDNA is being proposed 
as a screening measure to aid the successful eradication of tumors in 
humans and as a potential method for cancer diagnosis and 
prognosis [124]. 


4 Genomic Storage and Processing of Information 


It was not possible to understand how hereditary information was 
encoded and transmitted across generations without first having 
knowledge of the structure of DNA. Knowledge of DNA structure 
led to a structure-oriented conception of genomes as linear 
sequences of ordered nucleotides. Once protein synthesis was 
linked to gene sequences, the structural view of the genome 
began to be supplanted by the informational view [125]. Genetic 
information was initially viewed as a static property belonging to 
the specific sequence of ordered subunits. However, others have 
argued that the static view of information is not satisfactory (e.g., 
[4, 125]). Barbieri [125] contends that “it is only when a sequence 
provides a guideline to a copymaker that it becomes information for 
it. It is only an act of copying, in other words, that brings organic 
information into existence.” Based on Barbieri’s viewpoint, infor- 
mation is not always a property of a specific structure (e.g., DNA or 
RNA); rather his view is that such molecules are information rele- 
vant only when they are used to perform a biological function. A 
DNA sequence, for example, is said to have information if it is 
transcribed or interacts with a protein in a biologically relevant 
way. Similarly, an mRNA transcript also encodes information as it 
is translated into a protein. Also then, a protein could be viewed as 
an informational entity in the sense that it is necessary to carry outa 
biological function. Therefore, under this new conception, as well 
as the static view, it is clear that biological information can be 
manifest in different biological molecules; an observation that has 


4.1 Gene Expression 


Introduction to Genome Biology and Diversity 17 


complicated the notion of the genome as the fundamental unit of 
biological information [4]. 

We now understand that storage of the genetic information 
required to sustain life does not need to be restricted to biological 
molecules. This was vividly illustrated in the laboratory when a 
bacterial genome was chemically sequenced, its information stored 
within a computer (a completely different medium composed of 
binary states), then resynthesized in the form of a new DNA 
chromosome, and that synthetic DNA ultimately used as the sole 
means to maintain a living cell [126]. Although the information 
required for life can be stored independently of the chemical struc- 
ture of the DNA, it cannot be expressed in a biologically useful 
form without various proteins and RNA molecules. Thus, expres- 
sion of information encoded within a genome (bringing that infor- 
mation into existence) is contingent on its cellular context. In this 
section, we examine different ways in which information may be 
contained within a genome and mechanisms that result in biologi- 
cally useful expression of that information. 


Mere knowledge of the DNA sequence of a genome is often insuf- 
ficient to predict phenotype. The amount and timing of gene 
expression play a key role. For example, human cells with a nucleus 
have copies of almost identical DNA sequences. Yet cells perform 
varying functions, and they organize to create the multiple organs 
that constitute the human body. Cells achieve this primarily by 
differentially regulating the rate of transcription and/or translation 
of genes. 

DNA transcription and protein translation comprise elemen- 
tary levels of information transfer from genotype to phenotype. 
Maintaining control of these processes is fundamental for all organ- 
isms. Genetic elements involved in regulating gene expression are 
referred to as regulatory elements. They often represent sequences 
found on the DNA or RNA. In this way, regulatory information can 
be encoded directly within the nucleic acid sequence. Direct struc- 
tural proximity if often not necessary, as regulatory elements may be 
found proximal or distal to the genes they affect. In humans, 
approximately 8% of nuclear DNA is composed of elements 
involved directly in regulation such as promoters, enhancers, silen- 
cers, and insulators (defined in Subheading 4.1.1; [127, 128]). 

If all genetic and regulatory information is encoded in the 
DNA sequence, why can’t any cell with a complete genome be 
used to produce a viable organism? The specificity of cells suggests 
that additional regulatory markers also exist outside of the primary 
DNA sequence. This type of regulation is epigenetic (above the 
genes) and is essential for normal development. Epigenetic infor- 
mation is derived from chemical modifications of the chromosome 
(eg, DNA methylation or histone modification) that do not 
change the primary sequence of chromosomal DNA and can be 
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4.1.1 Transcriptional 
Regulation 


4.1.2 Translational 
Regulation 


passed from one generation to the next [129, 130]. It is only 
through the collective actions of all cellular processes that gene 
products contribute to biochemical pathways and participate in 
the network of regulatory interactions to produce a complex organ- 
ism or phenotype. 


DNA transcription is the chemical process through which informa- 
tion is transferred from DNA to RNA. The transcribed RNA may 
itself carry out some biological function or may be part of an 
intermediate information-carrying class of RNA known as messen- 
ger RNA (mRNA). mRNA along with other RNA molecules 
(tRNA and rRNA) are part of the machinery used to synthesize 
proteins. The flow of genetic information from DNA to RNA to 
protein is present in all forms of life. However, it is important to 
note that information transfer is not exclusively unidirectional. The 
enzyme reverse transcriptase can transfer genetic information from 
an RNA template into DNA. 

The basic model of transcriptional regulation requires that 
regulatory proteins called transcriptional factors (TFs) bind specific 
DNA sequences in regulatory modules (RMs). TFs are protein 
products that are themselves subjected to regulation of gene 
expression. RMs are defined according to both the primary DNA 
sequence to which TFs bind and their role in the process of reg- 
ulating gene expression. One type of RMs are promoters. They are 
specific motifs on DNA that are necessary regulatory elements for 
RNA transcription in prokaryotes and eukaryotes. They bind the 
basal transcriptional machinery, RNA polymerase and general TFs. 
Enhancers are RMs that bind activator proteins and enhance the 
affinity of RNA polymerase to the promoter region. They, there- 
fore, result in an upregulation of transcription of a gene or set of 
genes. Enhancers often act by stabilizing RNA polymerase binding 
through structural histone modifications [131]. Silencers are regu- 
latory elements that when bound to repressor proteins function to 
prevent gene transcription. Silencers and enhancers are often 
distance-independent, meaning that they can act on gene(s) that 
are proximal or distal to their location [132]. Enhancers can be 
thought of as ov-switches for gene expression, whereas silencers are 
the off-switches. 


The fate of all mRNAs, transcribed from protein-coding genes, is 
not the same. The mRNA is often subjected to translational regu- 
lation depending on cellular and environmental conditions. These 
regulatory mechanisms affect the rate of protein synthesis. In pro- 
karyotes and eukaryotes, most translational regulation involves 
structural changes in the mRNA molecule that impact its accessi- 
bility [133, 134]. The mRNAs can be sequestered in stress granules 
or localized in specific regions of a cell’s cytoplasm 
[135-137]. Another mechanism of translational regulation is 
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RNA interference (RNAi). This regulation strategy is common in 
eukaryotes and involves short noncoding RNAs—microRNA 
(miRNA) or small interfering RNA (siRNA)—that bind with 
imperfect complementarity to their target mRNA transcripts. The 
binding of miRNA (or siRNA) to mRNA destabilizes (or degrades) 
the target mRNA, thereby inhibiting its translation. The imperfect 
pairing allows a single RNAi molecule to affect the expression of 
multiple genes. In the human genome, almost 50% of mRNA 
transcripts are regulated by one or more miRNAs [138]. 

In prokaryotes, transcription and translation are more tightly 
coupled than in eukaryotes, and this allows prokaryotes to regulate 
their gene expression primarily by controlling the amount of tran- 
scription. Nevertheless, prokaryotes can still conduct translational 
regulation. They can employ fundamentally different types of trans- 
lational regulatory machinery: the recently discovered CRISPR-Cas 
system. Although the CRISPR loci were first identified in prokar- 
yotes in 1987 [139], it was only recently described as a bacterial 
immune defense system [140]. The CRISP-Cas system is most 
commonly known to target external DNA (viral or plasmid) and 
degrade it before it can be transcribed or translated. Recent 
advancement suggests that some CRISPR-Cas systems are more 
general and have the capacity to target RNA molecules. This was 
first discovered in Pyrococcus furiosus [141]; similar RNA targeting 
was later found in Sufolobus solfataricus [142]. Throughout these 
advancements, CRISPR-Cas system was still strictly viewed as an 
immune response to target and degrade external nucleic acid mole- 
cules. It was only in 2016 that a CRISPR-Cas system was discov- 
ered that targets cellular mRNAs and thereby participates in 
translational regulation [143]. 


The term epigenetics was coined in 1942 by Waddington 
[144]. He defined it as changes in an organism’s phenotype with- 
out an underlying alteration of its genome. It is now understood 
that epigenetic effects cause variation in phenotypes not associated 
with a change in the primary sequence but by chemical alterations 
of the DNA. Consider this analogy: throughout this review, when- 
ever a word was being defined it was written in this format. If this 
chapter was rewritten with all bolds and italics removed, the infor- 
mational content would be unaltered; however, the emphasis 
would be different. These “decorative” changes in font are akin to 
chemical epigenetic markers appended to the DNA. DNA methyl- 
ation is a type of chemical decoration that is analogous to striking 
through a phrase. Specifically, it corresponds to the addition of a 
methyl group to parts of the DNA that results in gene silencing 
[145]. This additional information is not directly encoded within 
the primary DNA sequence but is manifested through chemical 
changes in nucleotides [145]. Thus, DNA methylation is one 
form of epigenetic control of gene expression. Epigenetic factors 
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4.2 Mobile Genetic 
Elements 


may also have an impact on regulation by changing protein-DNA 
binding. In eukaryotes, epigenetic factors may bind to consecutive 
histones moving them closer to each other. This results in local 
DNA compaction and prevents the expression of the gene(s) in this 
location. 

Importantly, an organism’s exposure to certain environmental 
conditions can impact the epigenetic markers on its genome. 
Because epigenetic mechanisms ultimately affect the physiological 
form of the chromosome, such environmental exposures can lead 
to heritable changes in gene expression with no change to the 
underlying DNA sequence. It was initially thought that these 
alterations are not heritable and that following fertilization all 
epigenetic markers are removed from the zygote genome. Accu- 
mulating evidence suggests that such erasure of epigenetic marks 
occurs for most but not all genes [129, 130]. 


Also known as transposons or jumping genes, mobile genetic ele- 
ments are sequences that can move around within a genome inde- 
pendently of the complex networks which otherwise regulate gene 
expression [146]. Through their movement, transposons often 
cause mutations either by inserting into a gene and disturbing its 
function or by promoting DNA rearrangement. If a transposon is 
inserted within a protein-coding region, then it will undoubtedly 
affect the expression of this gene by altering the final protein 
product. Transposons may also be inserted into regulatory regions 
resulting in over- or under-expression of certain gene(s). The capa- 
bility of these DNA sequences to produce new copies of themselves 
elsewhere in a genome is called transposition. The two types of 
transposition are: 


Copy-and-paste (replicative) transposition: a new copy of the trans- 
posable element is inserted into a new site, while the old copy 
remains integrated into the original site [147]. This type of 
transposition requires transfer of information into an RNA 
intermediate (retrotransposons) and subsequent retro- 
transcription into DNA. This mechanism results in an increase 
in the number transposon copies. 


Cut-and-paste (non-replicative or conservative) transposition: the 
transposable element is excised from the old site and is inserted 
into a new site in the genome. The number of transposons is 
not increased in this case [147]. 


Transposable elements are found in all cell types. The kinds of 
transposable elements vary within and between prokaryotes and 
eukaryotes. They are often viewed as genetic parasites since they 
rely on a host cell for information processing systems (replication, 
transcription, and/or translation). In humans, about 44% of the 
genome is comprised of sequences that are related to transposable 
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elements [148]. These mobile genetic elements had an important 
impact on eukaryotic evolution [149, 150]. For example, siRNA 
regulation is believed to have evolved to regain control of the 
expression of transposable elements [151]. For a review of the 
regulatory mechanisms of transposable elements, see ref. 152. 


5 The Role of the Genome as an Informational Entity in Biology 


Although the information contained within a genome is necessary 
to maintain a living cell, it is not sufficient on its own. Expression of 
biologically useful information requires a complex network of cel- 
lular components for processing and regulation of the genome. 
This dependency on external cellular components permits consid- 
erable flexibility in how the information is stored. As we have seen, 
the information essential for eukaryotic life is partitioned between 
chromosomes located in nuclear and organelle compartments, with 
some nuclear-encoded proteins being transported to the organelle 
for assembly with other proteins synthesized within the organelle 
[110]. Thus, as long as the cellular mechanisms for expression and 
processing are in place, genomic information can be physically 
dispersed within the cell. The Cryptophytes have taken this to an 
extreme, having their genomic information distributed across four 
cellular compartments: the nucleus, nucleomorph, mitochondria, 
and plastids [153]. Clearly, the physical location of the genome is 
not a constraint to information storage and processing. Further- 
more, the storage of that information need not remain in a particu- 
lar physical location. In the case of temperate phages, genomic 
information is transferred, for a period of time, to the genome of 
its host where it is maintained by its host’s replication processes 
[154]. These examples, and others (e.g., [126]), underscore the 
importance of viewing the genome foremost as an informational 
entity irrespective of its physical location. 

In a well-argued critique of conventional notions of the 
genome, Goldman and Landweber [4] argue that viewing DNA 
as the sole source of information leads to additional difficulties. 
Recall that the NIH definition refers to the genome as containing 
all of the information needed to build and maintain that organism. 
We now understand that even the cell and its associated cytoplasm 
are not always sufficient for realization of all functional capabilities 
encoded within a genome. In other words, the genome, as conven- 
tionally defined, appears to be an incomplete informational entity 
[4]. Genome research has identified a variety of extracellular infor- 
mational entities that can influence, and in some cases are even 
essential to, the creation and maintenance of an organism. Below 
we review selected examples of this phenomenon prior to reasses- 
sing the definition of the genome in light of modern genome 
science. 
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Marine cyanobacteria (Prochlorococcus and Synechococcus) are 
among the most abundant photosynthetic organisms in the world’s 
oceans. The viruses that infect them (cyanophages) were discovered 
to possess copies of some of their hosts photosynthesis genes (e.g., 
PsbA and PsbD: [155, 156]). Through the process of HGT, the 
cyanophages acquired host genes, which they express after infection 
to optimize their own gene expression and broaden their host range 
[157]. As novel as this discovery was, it was completely unexpected 
that the cyanobacteria and their phages continued to exchange 
genetic variation through homologous recombination 
[157]. Through such exchanges, the PsbA and PsbD genes partici- 
pate in gene pools that extend beyond the photosynthetic species 
boundaries [157]. Given that cyanobacteria contribute as much as 
30% of carbon fixation worldwide, those findings suggest that viral 
gene pool dynamics have influenced the evolution of oceanic pho- 
tosynthesis on a global scale. This case demonstrates that to fully 
understand the origin and distribution of photosynthetic diversity, 
one must be aware that relevant genetic information can reside 
outside of the genomes of the photosynthetic organisms. 

The bacterial genus Listeria is comprised of ecologically diver- 
gent lineages that share gene pools through the process of homol- 
ogous recombination [158, 159]. Listeria monocytogenes is a 
pathogen closely related to the nonpathogenic species L. innocua. 
L. monocytogenes evolved as a pathogen through the process of 
HGT [160] and then subsequently evolved into ecologically diver- 
gent lineages differing in population structure and ability to 
respond to environmental stress [161]. Among Listeria, recombi- 
nation is frequent enough to permit natural selection to act inde- 
pendently of the variability present at unlinked loci, thereby 
promoting or impeding exchangeability of genes among species 
and ecotypes residing in different niches [159]. This is just one 
example of the “mosaic genome” model of prokaryotic genome 
evolution, where the combined effects of recombination, drift, and 
selection lead to genomes comprised of a mosaic of differentially 
extendible trans-species gene pools. A wide variety of bacterial 
species are now thought to have genome dynamics consistent 
with the mosaic genome model [159, 162-165]. In some cases, 
the process of genomic divergence can even become decoupled 
from the process of ecological divergence [159, 163]. Thus, the 
physical genomes of some species of prokaryotes are incomplete 
informational entities. 

The single-celled stichotrichous ciliates Oxytricha and Stylony- 
chia have two nuclei that store genomic information in very differ- 
ent forms [166]. One nucleus, called the macronucleus, contains 
information in the form required for growth and maintenance of a 
cell. Hence, the macronuclear DNA is often referred to as “active.” 
The second nucleus, called the micronucleus, contains the same 
information in a “stored” form, which is used to produce the active 
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form of the DNA in the next generation. However, information 
storage in the micronucleus is extremely complex. Protein-coding 
genes expressed by the macronucleus are partitioned into small 
segments, inverted, and scrambled among ~1 GB of other DNA 
sequences within the micronucleus. Furthermore, the production 
of a working macronucleus in the next generation cannot be 
accomplished without information contained within both small 
RNA molecules (piRNA) and long RNA templates (IncRNA), 
which are passed across generations via the cytoplasm of the mater- 
nal macronucleus [167, 168]. The piRNA are crucial to the elimi- 
nation of DNA during the development of an active macronucleus, 
and the IncRNA mediate (1) unscrambling of the inactive micro- 
nuclear DNA, (2) regulation of gene dosage in the macronucleus, 
and (3) epigenetic transfer of somatic (macronuclear) alterations 
that are not found within the germ-line (micronuclear) DNA 
[167]. Thus, without those RNA molecules, the DNA genome of 
the stichotrichous ciliates is an incomplete informational entity 
[4]. Furthermore, emerging work on both Oxytricha and Stylony- 
chia suggests that epigenetic modification of their DNA may play a 
role in the production of active macronuclear DNA 
[166, 169-171] 

Complex microbial communities live in close association with 
the human body and have a strong impact on human health and 
disease. Host genetic variation is known to influence the composi- 
tion of those communities [172], and, conversely, microbial varia- 
bility is thought to influence various host disease states [173]. This 
association is so intimate that the microbiome has been referred to 
as an additional “human organ” [174], and substantial amounts of 
missing heritability associated with many complex human diseases 
are now being attributed, in part, to a failure to adequately account 
for microbial genetic variation [175]. Taking inflammatory bowel 
disease (IBD) as an example, host human genetic variation accounts 
for less than 50% of its estimated heritability [176]. This result 
implies that there exists undiscovered context dependence of 
human genetic variation for IBD. We have since come to under- 
stand that there is extensive inter-individual variation in the genetic 
composition of the gut microbiome and this metagenomic varia- 
tion can influence healthy and dysregulated human immune 
responses [177] and is predictive of IBD patient outcomes 
[178]. Because the development of the IBD phenotype is related 
to gut microbiome variability, and because genetically similar 
human hosts can have different microbiomes, heritability estimates 
for human DNA variation will be impacted [175]. In other words, 
the expression of similar IBD phenotypes in humans is a function of 
both human and microbial genetics. Regardless of whether such 
interactions should be formally included within any future concep- 
tion of the genome, this example illustrates how the human 
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genome is also an incomplete informational entity with respect to 
prediction of healthy and disease states. 

Goldman and Landweber [4] suggest that the notion of the 
genome should be reconceptualized in light of our modern, and 
deeper, understanding of genomic diversity and the mechanisms of 
information storage and processing. We agree and follow Goldman 
and Landweber [4] when they call for a “more expansive definition 
of the genome as an informational entity, often but not always 
manifest as DNA, encoding a broad set of functional capabilities 
that, together with other sources of information, produce and 
maintain the organism.” At first glance, this appears to be consis- 
tent with the controversial idea that a collection of functionally 
integrated organisms, called a holobiont, is a fundamental unit of 
biological organization and their set of genomes, called a hologen- 
ome, is itself a unit subject to evolution by natural selection 
[179]. However, we cannot go this far. We expect that any holo- 
genome composed of informational entities having even a little 
independence is analogous to intra-genomic epistasis with just a 
little recombination. In the latter case, adaptive coevolution is not 
very effective at moving the system on its fitness landscape via 
compensatory substitutions [180]. Further, when informational 
entities are largely independent, either through high recombina- 
tion (as observed in Listeria) or through independent replication 
(as within the human gut microbiome), the process of genomic 
divergence can become decoupled from ecological dynamics. Thus, 
we cannot agree with the notion of the hologenome as a unit of 
selection. Rather, we view the genome as a potential mosaic of gene 
pools subject to different evolutionary dynamics, and we follow 
Goldman and Landweber [4] by considering it foremost as an 
informational entity, which may be incomplete and which does 
not have to manifest exclusively as the DNA within a species 
boundary. 
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Abstract 


In this chapter, we review basic concepts from probability theory and computational statistics that are 
fundamental to evolutionary genomics. We provide a very basic introduction to statistical modeling and 
discuss general principles, including maximum likelihood and Bayesian inference. Markov chains, hidden 
Markov models, and Bayesian network models are introduced in more detail as they occur frequently and in 
many variations in genomics applications. In particular, we discuss efficient inference algorithms and 
methods for learning these models from partially observed data. Several simple examples are given 
throughout the text, some of which provide the basis for models that are discussed in more detail in 
subsequent chapters. 


Key words Bayesian inference, Bayesian networks, Dynamic programming, EM algorithm, Hidden 
Markov models, Markov chains, Maximum likelihood, Statistical models 


1 Statistical Models 


Evolutionary genomics can only be approached with the help of 
statistical modeling. Stochastic fluctuations are inherent to many 
biological systems. Specifically, the evolutionary process itself is 
stochastic, with random mutations and random mating being 
major sources of variation. In general, stochastic effects play an 
increasingly important role if the number of molecules, or cells, 
or individuals of a population is small. Stochastic variation also 
arises from measurement errors. Biological data is often noisy due 
to experimental limitations, especially for high-throughput tech- 
nologies, such as microarrays or next-generation sequencing [1, 2]. 

Statistical modeling addresses the following questions: What 
can be generalized from a finite sample obtained from an experi- 
ment to the population? What can be learned about the underlying 
biological mechanisms? How certain can we be about our model 
predictions? 

In the frequentist view of statistics, the observed variability in 
the data is the result of a fixed true value being perturbed by 
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random variation, such as, for example, measurement noise. Prob- 
abilities are thus interpreted as long-run expected relative frequen- 
cies. By contrast, from a Bayesian point of view, probabilities 
represent our uncertainty about the state of nature. There is no 
true value, but only the data is real. Our prior belief about an event 
is updated in light of the data. 

Statistical models represent the observed variability or uncer- 
tainty by probability distributions [3, 4]. The observed data are 
regarded as realizations of random variables. The parameters of a 
statistical model are usually the quantities of interest because they 
describe the amount and nature of systematic variation in the data. 
Parameter estimation and model selection are discussed in more 
detail in the next section. In this section, we first consider discrete, 
and then continuous random variables and ` univariate 
(1-dimensional) before multivariate (n-dimensional) ones. We 
start by formulating the well-known Hardy-Weinberg principle 
[5, 6] as a statistical model. 


Example 1 (Hardy-Weinberg Model): The Hardy-Weinberg model 
is a statistical model for the genotypes in a diploid population of 
infinite size. Let us assume that there are two alleles, denoted A and 
a, and hence three genotypes, denoted AA, Aa = aA, and aa. Let 
X be the random variable with state space ¥ = { AA, Aa, aa} 
describing the genotype. We parametrize the probability distribu- 
tion of X by the allele frequency p of A and the allele frequency 
q= l — pofa. The Hardy-Weinberg model is defined by: 


P(X = AA) = p’, (1) 

P(X = Aa) = 2(1 — p), (2) 

P(X =aa) = (1 - >)’. (3) 

The parameter space of the model is 


© = {pER|0< p< 1} = [0,1], the unit interval. We denote the 
Hardy-Weinberg model by HW(p) and write X ~ HW(p) if 
X follows the distribution (Eqs. 1-3). 


The Hardy-Weinberg distribution P(X) is a discrete probability 
distribution (or probability mass function) with finite state space: 
We have 0 < P(X = x) < 1 for all seh and SO, exP(X = x) = 
sin) -p)+(1-p) =[p+(1—p) =1. In general, any 
statistical model for a discrete random variable with nu states defines 
a subset of the (n — 1)-dimensional probability simplex: 


A,-1 = Pieces ye [0,1]” | Pit + Py, = 1}. (4) 


The probability simplex is the set of all possible probability distri- 
butions of X, and statistical models can be understood as specific 
subsets of the simplex [7]. 
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The Hardy-Weinberg distribution is of interest because it arises 
under the assumption of random mating. A population with major 
allele frequency p has genotype probabilities given in Eqs. 1-3 
after one round of random mating. We find that the new allele 
frequency: 


p = P(AA) + P(Aa)/2 = # + 2p(1 — p)/2 = p, (5) 


is equal to the one in the previous generation. Thus, genetic varia- 
tion is preserved under this simple model of sexual reproduction, 
and the population is at equilibrium after one generation. In 
other words, Eqs. 1-3 describe the set of all populations at 
Hardy-Weinberg equilibrium. The parametric representation: 


{ (PaaPaarPaa) E A2 | Paa = E, Paa = 20(1 P), 


ta = (1-2) }, (6) 


of this set of distributions is equivalent to the implicit representa- 
tion as the intersection of the Hardy—Weinberg curve: 


4 Paa Pa — Pha = 0 (7) 


with the probability simplex A, (Fig. 1). 

The simplest discrete random variable is a binary (or Bernoulli) 
random variable X. The textbook example of a Bernoulli trial is the 
flipping of a coin. The state space of this random experiment is the 
set that contains all possible outcomes, namely, whether the coin 
lands on heads (X = 0) or tails (X = 1). We write V = {0,1} to 
denote this state space. The parameter space is the set that contains 
all possible values of the model parameters. In the coin tossing 
example, the only parameter is the probability of observing tails, 
p, and this parameter can take any value between 0 and 1, so we 
write © = {p | 0 < p < 1} for the parameter space. In general, the 
event X = 1 is often called a “success,” and p= P(X = 1) the 
probability of success. 


Aa 


AA aa 


Fig. 1 De Finetti diagram showing the Hardy-Weinberg curve 
4 Dan Paa — fe = 0 inside the probability simplex Az = {( Paa, Paas Dall 
Paa + Paa + Paa = 1}. Each point in this space represents a population as 
described by its genotype frequencies. Points on the curve correspond to 
populations in Hardy-Weinberg equilibrium 
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Example 2 (Binomial Distribution): Consider n independent 
Bernoulli trials, each with success probability p. Let X be the 
random variable counting the number of successes k among the 
n trials. Then, X has state space VY = {0, ...,} and 


P(X =) = (7 at We (8) 


This is the binomial distribution, denoted Binom(y, p). Its param- 
eter space is © = N x [0,1]. Examples of binomially distributed 
random variables are the number of “heads” in successive coin 
tosses or the number of mutated genes in a group of species. 


Important characteristics of a probability distribution are its 
expectation (or expected value, or mean) and its variance. They 
are defined, respectively, as: 


E(X) =) ox P(X =»), (9) 


A E 


Var(X) = X [x - EUO P(X =»). (10) 
xEX 
The standard deviation is \/ Var(X). For the binomial distribution, 
X ~ Binom(n, p), we find E(X) = np and Var(X) = np(1 — p). 


Example 3 (Poisson Distribution): The Poisson distribution Pois(A) 
with parameter A > 0 is defined as: 
He —À 

P(X =k) = S , ten (11) 
It describes the number X of independent events occurring in a 
fixed period of time (or space) at average rate A and independently 
of the time since (or distance to) the last event. The Poisson 
distribution ` has equal expectation and variance, 
E(X) = Var(X) = 4. 


The Poisson distribution is used frequently as a model for the 
number of DNA mutations in a gene after a certain time period, 
where / is the mutation rate. Both the binomial and the Poisson 
distribution describe counts of random events. In the limit of large 
n and fixed product np, the two distributions coincide, 
Binom(n, p) —> Pois(np), for n oo. 


Example 4 (Shotgun Sequencing): Let us consider a simplified model 
of the shotgun approach to DNA sequencing. Suppose that a reads 
of length L have been obtained from a genome of size G. We 
assume that all reads have the same probability of being sequenced. 
Then, the probability of hitting a specific base with one read 
is p = L/G, and the average coverage of the sequencing run is 
c = np. Under this model, the number of times X a single base is 
sequenced is distributed as Binom(m, p). For large n, we have 
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Probability 
0.10 0.15 0.20 
L 


0.05 
| 


Reads per base 


Fig. 2 Coverage distribution of a shotgun sequencing experiment with n = 10° reads of length L = 100 of the 
human genome of length G = 3 - 10°. The average coverage is c = np = 3.4, where p = L/G. Dots show 
the binomial coverage distribution Binom(n, p) and the solid line its approximation by the Poisson distribution 
Pois(np). Note that the Poisson distribution is also discrete and just shown as a line to distinguish it from the 
binomial distribution 


P(X = k) = Ët EE (12) 


For example, using next-generation sequencing technology, one 
might obtain a = 10° reads of length L = 100 bases in a single 
run. For the human genome of length G = 3 - 10°, we obtain a 
coverage of c = 3.4. The distribution of the number of reads per 
base pair is shown in Fig. 2. In particular, the fraction of unse- 
quenced positions is P(X = 0) = e ° = 3.57%. 


A continuous random variable X takes values in A = R and is 
defined by a nonnegative function f(x) such that: 


P(X€B) = f Fois for all subsets B C R. (13) 
B 
The function fis called the probability density function of X. For an 
interval: 
b 
P(X€[a,b]) = P(a < X < b) = f f(x)dx. (14) 


The cumulative distribution function is 
b 
F(b) = P(X < b) = / f(x)dx, DER. (15) 


Thus, the density is the derivative of the cumulative distribution 
eee d = 
function, Z F(x) = f (x). l l 

In analogy to the discrete case, expectation and variance of a 


continuous random variable are defined, respectively, as: 
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E(X) = T. x f(x) dx, (16) 
v = i, "be E(X)P f(x) de (17) 


Example 5 (Normal Distribution): The normal (or Gaussian) dis- 
tribution has the density function: 


f(x*) = (220?) "exp 


Jl (18) 


202 


The parameter space is © = {(u,0°) | uE R, o?’ E R,}. A normal 
random variable X ~ Norm(y,07) has mean E(X) = H and 
variance Var(X) = 6”. Norm(0,1) is called the standard normal 
distribution. 


The normal distribution is frequently used as a model for 
measurement noise. For example, X ~ Norm(4, 0°) might describe 
the hybridization intensity of a sample to a probe on a microarray. 
Then, y is the level of expression of the corresponding gene and 
o° summarizes the experimental noise associated with the micro- 
array experiment. The parameters can be estimated from a finite 
sample (di)... 6) Le, from N replicate experiments, as the 
empirical mean and variance, respectively: 


l N 
x= a (19) 


SEN () wi 

E A (x x). (20) 

The normal distribution plays a special role in statistics due to 

the central limit theorem. It asserts that the average 

Xy LEI +--+ X))/N of N independent (see below) and 

identically distributed (i.i.d.) random variables X® with equal 

mean py and variance o° converges in distribution to the standard 
normal distribution: 


VN (== £, Norm(0,1), (21) 
irrespective of the shape of their distribution. As a consequence, 
many test statistics and estimators are asymptotically normally 
distributed. For example, the Poisson distribution Pois(A) is 
approximately normal Norm(4,4) for large values of 4. 

We often measure multiple quantities at the same time, for 
example the expression of several genes, and are interested in 
correlations among the variables. Let X and Y be two random 
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variables with expected values wx and pyand variances o% and oi, 
respectively. The covariance between X and Y is 
Cov(X, Y) = E[(X = #x)(Y — all = EICH - EE 
(22) 


and the correlation between X and Y is = 


H 


p 
Cov(X, Y)/(oxor). For observations (x, di... (Hl y), 
the sample correlation coefficient is 


DEER (23) 


Tea" (N = l)sxsy 


where sy and sy are the sample standard deviations of X and Y, 
respectively, defined in Eq. 20. 

So far, we have worked with univariate distributions and we 
now turn to multivariate distributions, i.e., we consider random 
vectors X = (Xj, ..., Xn) such that each X; is a random variable. 
For the case of discrete random variables X;, we first generalize the 
binomial distribution to random experiments with a finite number 
of outcomes. 


Example 6 (Multinomial Distribution): Let K be the number of 
possible outcomes of a random experiment and @; the probability of 
outcome k. We consider the random vector X = (Xj, ...,; Xx) with 
values in A = N*, where X; counts the number of outcomes of type k. 
The multinomial distribution Multi a, 01, ...,9x) is defined as: 
P(X =x) = — 6"... 24 
( t= age (24) 
if ar xp = n, and 0 otherwise. The parameter space of the model 
isO = N x Ag. For K = 2, we recover the binomial distribution 
(Eq. 8). Each component X; of a multinomial vector has expected 
value E(X,) = 70, and Var Kul = ail — 04). The covariance of 
two components is Cov(X;, X1) = —n0;01, for k # 1. 


In general, the covariance matrix È of a random vector X is 
defined by: 


Xi; = Cov(X;, Xj) = E[(Xi — Mole: m)l, (25) 


where u; is the expected value of X;. The matrix È is also called the 
variance—covariance matrix because the diagonal terms are the var- 
iances Zu = Cov(X;, X;) = Var(X;). 

A continuous multivariate random variable X takes values in 
X = R”. It is defined by its cumulative distribution function: 


F(x)= PX eeh xER” (26) 
or, equivalently, by the probability density function: 
Eh 


E EECH ER”. 27 
do, + -OXy (x1, 9% ); A ( ) 


f(x) = 
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Example 7 (Multivariate Normal Distribution): For n > 1 and 
«x €R”, the multivariate normal (or Gaussian) distribution has 


density: 
—n = 1 = 
f(x) = Dal" detz) exp] -5 (x — uE (w ii, 


(28) 
with parameter space © = {(u, ©) | u= (m,.-.,m,)€R” and 
x= (a jery, where Z is the symmetric, positive-definite 


covariance matrix and p the expectation. We write 
X (X3, ..., Xn) ~ Norm(y, £) for a random vector with such a 
distribution. 


We say that two random variables Xand Y are independent if P 
(X, Y) = P(X)P(Y ) or, equivalently, if the conditional probability 
P(X | Y) = P(X, Y)/P(Y ) is equal to the unconditional proba- 
bility P(X). If X and Y are independent, denoted X L Y, then 
E[XY ] = ELXJE[Y] and Var(X + Y) = Var(X) + Var(7Y). It 
follows that independent random variables have covariance zero. 
However, the converse is only true in specific situations, for exam- 
ple if (X, Y) is multivariate normal, but not in general because 
correlation captures only linear dependencies. 

This limitation can be addressed by using statistical models 
which allow for a richer dependency structure. Subheading 7 is 
devoted to Bayesian networks, a family of probabilistic graphical 
models based on conditional independences. Let X, Y, and Z be 
three random vectors. Generalizing the notion of statistical inde- 
pendence, we say that X is conditionally independent of Y given 
Zand write X 1 Y| Zif KX, Y| Z) = PX | JAY | Z). Bayes’ 
theorem states that 


| NPY) 
PX 3 


where P(Y ) is called the prior probability and P(Y | X) the poste- 
rior probability. Intuitively, the prior P(Y ) encodes our a priori 
knowledge about Y (Ge, before observing X), and P(Y | X) is our 
updated knowledge about Y a posteriori (i.e., after observing X). 

We have P(X) =X7P(X, Y) if Y is discrete, and similarly 
P(X) =frXX, Y)dY if Y is continuous. Here, P(X) is called 
the marginal and P(X, Y ) the joint probability. This summation 
or integration is known as marginalization (Fig. 3). 

Since P(X) =X yX, Y) =X7P(X| Y)P(Y), Bayes’ theo- 
rem can also be rewritten as: 


P(Y | X)= 


(29) 


_ PXI DPT) 
PEOS Px PY)" 


Ey 


(30) 


where P(y) = P(Y = y) and Y is the state space of Y. 
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2 2 1 0 1 2 


Fig. 3 Marginalization. Left: two-dimensional histogram of a discrete bivariate distribution with the two 


marginal histograms. Right: 
tions of each component 


contour plot of a two-dimensional Gaussian density with the marginal distribu- 


Example 8 (Diagnostic Test): We want to evaluate a diagnostic test 
for a rare genetic disease. The binary random variables D and 
T indicate disease status (D = 1, diseased) and test result (T= 1, 
positive), respectively. Let us assume that the prevalence of the 
disease is 0.5%, i.e., 0.5% of all people in the population are 
known to be affected. The test has a false positive rate (probability 
that somebody is tested positive who does not have the disease) of 
P(T= 1 | D= 0) = 5% and a true positive rate (probability that 
somebody is tested positive who has the disease) of P(T = 1 | D 
= 1) = 90%. Then, the posterior probability of a person having the 
disease given that he or she tested positive is 


P(D=1|T=1)= 


P(T =1| D=1)P(D=1) 
P(T =1| D=0)P(D=0)+ P(T=1| D=1)P(D=1) 


= 0.083, 


(31) 


that is, only 8.3% of the positively tested individuals actually have 
the disease. Thus, our prior belief of the disease status, P( D), has 
been modified in light of the test result by multiplication with 
P(T | D) to obtain the updated belief P(D | T). 


Exercise 9 (Conditional Independence): Let X, Y , and Zbe random 
variables. Using the laws of probability, show that X and Y are 
conditionally independent given Z (Oe, X 1 Y| Z) if and only if 
XX| T, Z)= X| Z). 
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2 Statistical Inference 


Statistical models have parameters and a common task is to estimate 
the model parameters from observed data. The goal is to find the 
set of parameters with the best model fit. There are two major 
approaches to parameter estimation: maximum likelihood 
(ML) and Bayes. 

The maximum likelihood approach is based on the likelihood 
function. Let us consider a fixed statistical model M with parameter 
space © and assume that we have observed realizations 
D = (ill, NIT of the discrete random variable X ~ Mia 
for some unknown parameter 09 € ©. For the fixed data set D, the 
likelihood function of the model is 


L(0) = P(®D | 0), (32) 
where we write P(® | 80) to emphasize that, here, the probability of 
the data depends on the model parameter @. For continuous ran- 
dom variables, the likelihood function is defined similarly in terms 
of the density function, L(0) = f(D| 6). Maximum likelihood 
estimation seeks the parameter 0 € © for which L(@) is maximal. 
Rather than L(@), it is often more convenient to maximize 


€(0) = logL(@), the log-likelihood function. If the data are i.i.d., 
then: 


N 
£(0) = Slog P(X = x" | 8). (33) 
i=1 


Example 10 (Likelihood Function of the Binomial Model): Suppose 
we have observed k = 7 successes in a total of N = 10 Bernoulli 
trials. The likelihood function of the binomial model (Eq. 8) is 


Lip) = # =p)", (34) 


where p is the success probability (Fig. 4). To maximize L, we 
consider the log-likelihood function: 


ni = logL(p) = Alog(p) + (N — k)log(1 — p) (35) 


and the likelihood equation d£/dp = 0. The ML estimate (MLE) is 
the solution Du = k/N = 7/10. Thus, the MLE of the success 
probability is just the relative frequency of successes. 


Example 11 (Likelihood Function of the Hardy-Weinberg Model): If 
we genotype a finite random sample of a population of diploid 
individuals at a single locus, then the resulting data consists of the 
numbers of individuals 744, Dias, and Maa with the respective geno- 
types. Assuming Hardy—Weinberg equilibrium (Eqs. 1-3), we want 
to estimate the allele frequencies p and g= 1 — p of the 
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Likelihood 
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Fig. 4 Likelihood function of the binomial model. The underlying data set 
consists of k = 7 successes out of N = 10 Bernoulli trials. The likelihood L 
(p) = p (1 — p)”* is plotted as a function of the model parameter p, the 
probability of success (solid line). The MLE is the maximum of this function, 


Duw = K/N = 7/10 (dashed line) 


population. The likelihood function of the Hardy-Weinberg model 
is L(p) = P(AA)”™ P(Aa)”™ P(aa)”® and the log-likelihood is 


tlp) = naslogp + nalog2p(1 — p) + malog(1 — py? 


36 
x (2naa Ale Na,)logp + (naa + 2na)log(1 ~~ P), ' l 


where we have dropped the constant m,,log2. The MLE of 
p € [0, 1] can be found by maximizing £. Solving the likelihood 
equation: 
ot _ 2naa + naa Naat 2Maa 
Op p Wat: 
yields the MLE Dn = (2naa + Maa)/(2N), where N= naa + 
NA, + Maa is the total sample size. For example, if we sample 
N = 100 genotypes with naa = 81, naa = 18, and maa = 1, then 
we find py = (2 - (81 + 18))/(2 - 100) = 0.9 for the frequency of 
the major allele. 


=0 (37) 


MLEs have many desirable properties. Asymptotically, as the 
sample size N —oo, they are normally distributed, unbiased, and 
have minimal variance. The uncertainty in parameter estimation 
associated with the sampling variance of the finite data set can be 
quantified in confidence intervals. There are several ways to 
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construct confidence intervals and statistical tests for MLEs based 
on the asymptotic behavior of the log-likelihood function 
€(@) = logL(@) and its derivatives. For example, the asymptotic 
normal distribution of the MLE is 


du, & Norm (o, IO), (38) 


where [(@) = —0°£/00° is the Fisher information and J(@) = E[I 
(0)] the expected Fisher information. This result gives rise to the 
Wald confidence intervals: 


EIERE EN CT (39) 


where A = inf{xER|1—a/2 < F(x)} is the (1 — a/2) 
quantile and F the cumulative distribution function of the standard 
normal distribution. Equation 38 still holds after replacing J(@) 
with the standard error se(@m1) = Un H ? or [J(@mx)] 2, and it 
also generalizes to higher dimensions. Other common construc- 
tions of confidence intervals include those based on the asymptotic 
distribution of the score function Aë) = 0€/00 and the 
log-likelihood ratio log(L(Omx.)/L(0)) [8]. 

We now discuss another more generic approach to quantify 
parameter uncertainty, not restricted to ML estimation, which is 
applied frequently in practice due to its simple implementation. 
Bootstrapping [9] is a resampling method in which independent 
observations are resampled from the data with replacement. The 
resulting new data set consists of (some of) the original observa- 
tions, and under i.i.d. assumptions, the bootstrap replicates have 
asymptotically the same distribution as the data. Intuitively, by 
sampling with replacement, one is pretending that the collection 
of replicates thus obtained is a good proxy for the distribution of 
data sets that one would have obtained, had we been able to actually 
replicate the experiment. In this way, the variability of an estimator 
(or more generally the distribution of any test statistic) can be 
approximated by evaluating the estimator (or the statistic) on a 
collection of bootstrap replicates. For example, the distribution of 
the ML estimator of a model parameter 0 can be obtained from the 
bootstrap samples. 


Example 12 (Bootstrap Confidence Interval for the ML Allele Fre- 
quency): We use bootstrapping to estimate the distribution of the 
ML estimator fyr, of the Hardy-Weinberg model for the data set 
(Maa, aa, Maa) = (81, 18, 1) of Example 11. For each bootstrap 
sample, we draw N = 100 genotypes with replacement from the 
original data to obtain random integer vectors of length three 
summing to 100. The ML estimate is computed for each of a 
total of B bootstrap samples. The resulting distributions of Bu. 
are shown in Fig. 5, for B = 100, 1000, and 10,000. The means of 
these empirical distributions are 0.899, 0.9004, and 0.9001, 
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B = 100 B = 1,000 B = 10,000 


Lo E D= (e 


T T 1 r T T T 1 r T T T d 
0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 0.80 0.85 0.90 0.95 1.00 


Fig. 5 Bootstrap analysis of the ML allele frequency. The bootstrap distribution of the maximum likelihood 
estimator Py, = (2Naa + Naa) /(2N) of the major allele frequency in the Hardy-Weinberg model is plotted for 
B = 100 (left), B = 1000 (center), and B = 10, 000 (right) bootstrap samples, for the data set (naa, Naa, 
Naa) = (81, 18, 1) 


respectively, and 95% bootstrap confidence intervals can be derived 
from the 2.5 and 97.5% quantiles of the distributions. For 
B= 100, 1000, and 10,000, we obtain, respectively, [0.8598, 
0.9350], [0.860, 0.940], and [0.855, 0.940]. The basic bootstrap 
confidence intervals have several limitations, including bias of the 
bootstrap estimator and skewness of the bootstrap distribution. 
Other methods exist for constructing confidence intervals from 
the bootstrap distribution to address some of them [9]. 


The Bayesian approach takes a different point of view and 
regards the model parameters as random variables [10]. Inference 
is then concerned with estimating the joint distribution of the 
parameters 0 given the observed data 9. By Bayes’ theorem 
(Eq. 30), we have 


D| OPO) Poig P(A) 
Pm ` foc P(D | 8) P(O) d0’ 


P(o| 0) =" (40) 
that is, the posterior probability of the parameters is proportional to 
the likelihood of the data times the prior probability of the para- 
meters. It follows that, for a uniform prior, the mode of the poste- 
rior is equal to the MLE. 

From the posterior, credible intervals of parameter estimates 
can be derived such that the parameter lies in the interval with a 
certain probability, say 95%. This is in contrast to a 95% confidence 
interval in the frequentist approach because, there, the parameter is 
fixed and the interval boundaries are random variables. The mean- 
ing of a confidence interval is that 95% of similar intervals would 
contain the true parameter, if intervals were constructed indepen- 
dently from additional identically distributed data. 

The prior P(@) encodes our a priori belief in 0 before observing 
the data. It can be used to incorporate domain-specific knowledge 
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into the model, but it may also be uninformative or objective, in 
which case all observations are equally likely, or nearly so, a priori. 
However, it can sometimes be difficult to find noninformative 
priors. In practice, conjugate priors are most often used. A conju- 
gate prior is one that is invariant with respect to the distribution 
family under multiplication with the likelihood, i.e., the posterior 
belongs to the same family as the prior. Conjugate priors are 
mathematically convenient and computationally efficient because 
the posterior can be calculated analytically for a wide range of 
statistical models. 


Example 13 (Dirichlet Prior): Let T= (Tj, ..., Tg) be a continuous 
random variable with state space Ax_}. The Dirichlet distribution 
Dir(a) with parameters a € RI has probability density function: 


K 
Ee K 
f, .-.,0K) = = Te, (41) 


where T is the gamma function. The Dirichlet prior is conjugate to 
the multinomial likelihood: If T ~ Dir(a) and (X| T=0@) ~ 
Mult(n, 01, ...,0xg), then (8 | X = x) ~ Dir(a+ x). For K= 2, 
this distribution is called the beta distribution. Hence, the beta 
distribution is the conjugate prior to the binomial likelihood. 


Example 14 (Posterior Probability of Genotype Frequencies): Let us 
consider the simple genetic system with two loci and two alleles 
each of Example 1, but without assuming the Hardy-Weinberg 
model. We regard the observed genotype frequencies (iaa, Maa, 
Maa) = (81, 18, 1) as the result of a draw from a multinomial 
distribution Mult(7, 04a, aa, Qaa). Assuming a Dirichlet prior 
Dir(@aa, Ga, Aaa), the posterior genotype probabilities follow the 
Dirichlet distribution Dir(a@aq + aa, Gas + aa, Aaa + Maa). In 
Fig. 6, the prior Dir(10, 10, 10) is shown on the left, the multino- 
mial likelihood P((nAa, Mass Maa) = (81, 18,1) | Aaa, Aaa, Oaa) in the 
center, and the resulting posterior Dir(10 + 81, 10+ 18, 10+ 1) 
on the right. Note that the MLE is different from the mode of the 
posterior. As compared to the likelihood, the nonuniform prior has 
shifted the maximum of the posterior toward the center of the 
probability simplex. 


We often have two or more competing models and would like 
to assess which one describes best the given data. For example, we 
may have observed genotypes from the set {AA, Aa, aa} and want to 
test whether the Hardy-Weinberg model (Example 1) is a more 
appropriate description of the genotype data than the multinomial 
model of the previous Example 14. Intuitively, we might want to 
select the model that fits the data best, for example, by comparing 
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Fig. 6 Dirichlet prior for multinomial likelihood. The Dirichlet prior is conjugate to the multinomial likelihood. 
Shown are contour lines of the prior Dir(10, 10, 10) on the left, the multinomial likelihood Dinan, Naa, 
Naa) = (81, 18, 1)|Oaa, Ona, Oaa) in the center, and the resulting posterior Dir(91, 28, 11) on the right. The 
posterior is the product of prior and likelihood 


their likelihoods. However, the Hardy-Weinberg model has only 
one parameter, namely the allele frequency p, whereas the multino- 
mial model has three parameters subject to the constraint O44 + 
Oaa + Oa, = 1. Hence, the number of free parameters is one and 
two, respectively, for the two models. This difference in the com- 
plexity of the models makes a comparison based only on the good- 
ness of fit invalid, because models with more parameters, i.e., 
higher complexity, can generally provide a better fit. Estimating 
model complexity and scoring models based on both model com- 
plexity and goodness of fit is therefore essential for model compari- 
son and model selection. 

The goal of model selection is to find the model that best 
generalizes to unseen data, rather than just fits the observed data, 
because we seek the model capable of the most accurate predic- 
tions. A model that fits well but generalizes poorly is said to overfit 
the data. Models that are too complex tend to overfit the data. 
Model selection can be regarded as finding the right level model 
complexity for the given data, such that the predictive performance 
is optimized. This involves defining a criterion of optimality and a 
procedure for finding the optimal model. 

A common frequentist approach to model selection are likeli- 
hood ratios. For a data set D, we compare a null model, Mo, to an 
alternative model, Mj), at given point estimates using the ratio of 
their likelihoods: 

A() = He) 
Lë 
If A(D) < c, for a defined threshold c, we reject the null model and 
favor the alternative model. The choice of e should be informed by 
the distribution of A under the null. If the two models are nested, 
i.e., if Mo can be obtained from Mj by specifying a subset of the 
parameters, then —2 logA is approximately y?-distributed with 
degrees of freedom equal to the difference in the number of free 
parameters between M; and Mo. 


(42) 
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In the Bayesian framework, it is natural to compare the poste- 
rior probabilities of the two models. By Bayes theorem, we have, for 


i=0,1: 
PM; | o) = 2021 pa (My) (43) 
where: 
P(D| Mj) = [P(O | 6;, M;)P(0; | M;) d0; (44) 


is the marginal likelihood. The marginal likelihood accounts for 
model complexity and for uncertainty in parameter estimates, but is 
usually analytically intractable and costly to compute. Various 
approximations of the marginal likelihood exist that give rise to 
model selection scores, such as the Bayesian information criterion 
(BIC; see Subheading 7) and the Akaike information criterion 
(AIC) [11]. 

For Bayesian model comparison, we consider the posterior 
odds: 

P(Mo | D) P(D| Mo) P(Mo) 


PO Ta P(D| Mi) PO (45) 


The ratio of the marginal likelihoods, i.e., the first factor on the 
right-hand side of Eq. 45, is called the Bayes factor. With equal 
priors, a Bayes factor larger than 20 is often considered strong 
support for Mo over My [12]. 


Exercise 15 (Poisson Distribution): We wish to model the number of 
bacterial colonies in a Petri dish and assume that the count data of 
this experiment follows a Poisson distribution Pois(4) (Example 3). 
Derive the log-likelihood function of this model and calculate the 
MLE of the model parameter 4. Suppose now that the number of 
bacterial colonies on a Petri dish follows the Poisson distribution 
with mean 4 = 5. What is the probability of finding exactly three 
colonies? 


3 Hidden Data and the EM Algorithm 


We often cannot observe all relevant random variables due to, for 
example, experimental limitations or study designs. In this case, a 
statistical model P(X, Z | 0 € ©) consists of the observed random 
variable X and the hidden (or latent) random variable Z, both of 
which can be multivariate. In this section, we write X = (x), shea 
X)) for the random variables describing the N observations and 
refer to X also as the observed data. The hidden data for this model 
is Z= (Z®, ..., AN) and the complete data is (X, Z). For 
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convenience, we assume the parameter space © to be continuous 
and the state spaces A of X and Z of Z to be discrete. 

In the Bayesian framework, one does not distinguish between 
unknown parameters and hidden data, and it is natural to assess the 
joint posterior P(A, Z| X) x P(X | 0, Z)P(0, Z), which is P(X, 
Z | 0)P(O) if priors are independent, i.e., if D. Z) = P(O)P(Z). 
Alternatively, ifthe distribution of the hidden data Zis not of interest, 
it can be marginalized out. Ven the posterior (Eq. 40) becomes 


=r (XZ | 0) P(A) dé 
In the likelihood framework, it can be more efficient to estimate 


the hidden data, rather than marginalizing over it. The hidden 
(or complete-data) log-likelihood is 


(46) 


Cnia(O) = log P(X, Z | 0) - ene Z| 8). (47) 


For ML parameter estimation, we need to consider the observed 
log-likelihood: 


el = logP(X | 0) = log (X28) 


= log Sin DD VER Z | 0). 


Erop 


(48) 


This likelihood function is usually very difficult to maximize and one 
has to resort to numerical optimization techniques. Generic local 
methods, such as gradient descent or Newton’s method, can be 
used, but there is also a more specific local optimization procedure, 
which avoids computing any derivatives of the likelihood function, 
called the expectation maximization (EM) algorithm [13]. 

In order to maximize the likelihood function (Eq. 48), we 
consider any distribution 4(Z) of the hidden data Z and write 


ei = 108) 12) PO = logkLP(X,Z | 8)/ 912) 


(49) 


where the expected value is with respect to g(Z). Jensen’s inequality 
applied to the concave log function asserts that log E[Y] > Eflog7]. 
Hence, the observed log-likelihood is bounded from below by 


Ellog(P(X, Z | @)/q(Z))], or 
Lovs(0) > Elénia(9)] + Ha, (50) 
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where H(q) = —E[logg(Z)| is the entropy. The idea of the EM 
algorithm is to maximize this lower bound instead of f41,(0) itself. 
Intuitively, this task is easier because the big sum over the hidden 
data in Eq. 48 disappears on the right-hand side of Eq. 50 upon 
taking expectations. 

The EM algorithm is an iterative procedure alternating 
between an E step and an M step. In the E step, the lower bound 
(Eq. 50) is maximized with respect to the distribution q by setting 
(Z) = FZ | X, 0), where 6 is the current estimate of 0, and 
computing the expected value of the hidden log-likelihood: 


Qe | Kéi = Box an vull, (51) 


In the M step, Q is maximized with respect to 0 to obtain an 
improved estimate: 


gl = arg max Q(0 | 6®). (52) 


The sequence A), A) oi converges to a local maximum of 
the likelihood surface (Eq. 48). The global maximum and, hence, 
the MLE is generally not guaranteed to be found with this local 
optimization method. In practice, the EM algorithm is often run 
repeatedly with many different starting solutions 0, or with few 
very reasonable starting solutions obtained from other heuristics or 
educated guesses. 


Example 16 (Naive Bayes): Let us assume that we observe realiza- 
tions of a discrete random variable (X1, ..., Xz) and we want to 
cluster observations into K distinct groups. For this purpose, we 
introduce a hidden random variable Z with state space 
Z=([K]={l1,...,K} indicating class membership. The joint 
probability of (X4, ..., Xz) and Zis 


Ee EE Tae) = P(Z) P(X, HAI | Z) 


= P(Z) (nz, |Z). SS 


The marginalization of this model with respect to the hidden data 
Zis the unsupervised naive Bayes model. The observed variables X, 
are often called features and Z the latent class variable (Fig. 7). 

The model parameters are the class prior P(Z), which we 
assume to be constant and will ignore, and the conditional prob- 
abilities d. zx = P(X, = x | Z = k). The complete-data likelihood 
of ern data X = (X®, ..., XO) and hidden data Z = (Z, 
bats ) is 


N N L 
P(X,Z | 0) = 11 P(x, a | 0) = IT p(Z') TT P(X’? | Z®) 


i (54) 


4 Markov Chains 
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ee e 


Fig. 7 Graphical representation of the naive Bayes model. Observed features Xp 
are conditionally independent given the latent class variable Z 


N 
=I D om") (55) 


H 


Se Io, ERT" See Kell sei "by 
where Lint”) is equal to one if and only if Z’ = kand X (3) = 
and zero otherwise. 

To apply the EM algorithm for estimating 0 without observing 
Z, we consider the hidden log-likelihood: 


Cnia(@) = logP(X,Z | 0) SE? KS 3 HE "ost, kx: 


i=] n=l kE[K] xex 


(56) 
In the E step, we compute the expected values of Z”: 
j ; P(X =x|Z® =k 
ee = Ez | x,=x ø [Z®] = l | ) 
X rerP (e =x|Z0 = k) 57 
S (57) 
n, kx 


5s y > 
Äer wx 
where @’ is the current estimate of d. The expected value ch is 
sometimes referred to as the responsibility of class k for observation 
x = = x. The expected hidden log- EE can be written in 
terms of the expected counts Ny py. = ES =] ue as: 


Ez, x,o[€nia(9)] -5 ` ` N n, pxlOgOn, ks- (58) 


n=l ke[K] sei 


In the M step, maximization of this sum yields 
On, kx = Nnt) u Nnt. 


A stochastic process {X;,¢€T} is a collection of random variables 
with common state space A. The index set T is usually interpreted as 
time and X, is the state of the process at time t. A discrete-time 
stochastic process X = (X1, Xz, X3,... ) is called a Markov chain 
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[14], if Xi. 1 Xn- | X, for all n > 2 or, equivalently, if each 
state depends only on its immediate predecessor: 


P(X, | Xn-1,.--,X1) = P(X, | Xn-1), for all n> 2. 
(59) 


We consider here Markov chains with finite state space 
X = [K] = {1,..., K} that are homogeneous, i.e., with transition 
probabilities independent of time: 


Tu = PUE =1|X,=h), forall RJE[K],n>2. (60) 


The finite-state homogeneous Markov chain is a statistical model 
denoted MC(II, T) and defined by the initial state distribution II € 
Axi, where II, = P(X, = k), and the stochastic K x K transition 
matrix T = (T;,). 

We can generalize the one-step transition probabilities Toto: 


Th = PR =; | Xj = k), (61) 


the probability of jumping from state ẹ to state /in n time steps. Any 
(n + m)-step transition can be regarded as an n-step transition 
followed by an m-step transition. Because the intermediate state 
z is unknown, summing over all possible values yields the 
decomposition: 


K 

Ti” A TET", for all nm >1, biet) (62) 
i=l 

known as the Chapman-Kolmogorov equations. In matrix nota- 

tion, they can be written as T” *”™® = TT). It follows that the 

n-step transition matrix is the n-th matrix power of the one-step 

transition matrix, T™® = T”. 

A state of a Markov chain is accessible from state kif T% > 0. 
We say that k and /communicate with each other and write k ~ Lif 
they are accessible from one another. State communication is reflex- 
ive (k ~ k), symmetric (k ~ L= L~ k), and, by the Chapman—Kol- 
mogorov equations, transitive (j~ k~ L= j~ 1). Hence, it 
defines an equivalence relation on the state space. The Markov 
chain is irreducible if it has a single communication class, i.e., if 
any state is accessible from any other state. 

A state is recurrent if the Markov chain will reenter it with 
probability one. Otherwise, the state is transient. In finite-state 
Markov chains, recurrent states are also positive recurrent, i.e., 
the expected time to return to the state is finite. A state is aperiodic 
if the process can return to it after any time n > 1. Recurrence, 
positive recurrence, and aperiodicity are class properties: if they 
hold for a state k, then they also hold for all states communicating 
with k. 


Probability, Statistics, and Computational Science 53 


A Markov chain is ergodic if it is irreducible, aperiodic, and 
positive recurrent. An ergodic Markov chain has a unique stationary 
distribution z given by: 


K K 
a= im Th= mu, Jl Som =1 (63) 
k=1 l=1 


independent of the initial distribution II. In matrix notation, z is 
the solution of a’ = SI 


Example 17 (Two-State Markov Chain): Consider the Markov chain 
with state space {1, 2} and transition probabilities T}2 = a > 0 and 
Ty, = £ > 0. Clearly, the chain is ergodic and its stationary distri- 
bution z is given by: 


Gr meia eil" - (64) 


or, equivalently, az, = prz. With mı + m2 = 1, we obtain sl = (a + 


DE 


In Example 17, ifa = 0, then state 1 is called an absorbing state 
because once entered it is never left. In evolutionary biology and 
population genetics, Markov chains are often used to model evol- 
ving populations, and the fixation probability of an allele can be 
computed as the absorption probability in such models. 


Example 18 (Wright—Fisher Process): We consider two alleles, A and 
a, in a diploid population of size N. The total number of A alleles in 
generation 7 is described by a Markov chain X,, with state space 
{0, 1,2, ..., 2N}. We assume that individuals mate randomly and 
that maternal and paternal alleles are chosen randomly such that 
(Xn | Xn) ~ Binom(2N, k/(2.N)), where k is the number of A 
alleles in generation m. The Markov chain has transition 
probabilities: 


mM  m 


If the initial number of A alleles is X, = k, then E(X)) = k. After 
binomial sampling, E(X2) = 2N(k/(2.N)) = kand hence E(X,,) = 
k for all n > 0. The Markov chain has the two absorbing states 
0 and 2N, which correspond, respectively, to extinction and fixa- 
tion of the A allele. To compute the fixation probability 4, of A 
given & initial copies of it: 


dy Dm P(X, =2N | X1 E, (66) 


we consider the expected value, which is equal to k, in the limit as 
n =œ to obtain 
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k= lim E(X,) = 0-(1 — hy) +2N- hr. (67) 
Thus, the fixation probability is just 4, = k/(2 N), the initial rela- 
tive frequency of the allele. The Wright—Fisher process [15, 16] is a 
basic stochastic model for random genetic drift, i.e., for the varia- 
tion in allele frequencies only due to random sampling. 


If we observe data X = (X™®, ..., XO) from a finite Markov 
chain MC(II, T) of length L, then the likelihood is 


N N . L-l , ; 
LM, T) = Wp(x) = Hee sei, (e 


(68) 
N Ei 
= tan I Tan x09,» 
which can be rewritten as: 
zm. 2. Tt Tne HI H ree 
i=l kE[K] hE[K] 1E[K] (69) 
SC m II T": 
€[K] kE[K] 1E[K] 


with Nal NI) the number of observed transitions from state kinto 
state / in observation XÒ, and N; = y Ny(X) the total 
number of k-to-/ transitions in the data, and similarly Na XI 
and N, the number of times the ż-th chain, respectively all chains, 
started in state k. 


Exercise 19 (Markov Chains): Let us consider a simple infectious 
disease model, where each individual is either healthy (H) or dis- 
eased (D). We assume the following two-state Markov chain to 
describe infection-related disease and recovery via clearance of the 
pathogen: 


The probability of a healthy individual becoming sick due to 
infection is a = 0.6, and the probability of a diseased individual to 
clear the infection and recover is p = 0.9. The initial probabilities 
for health and disease are P(H) = 0.7 and P(D) = 0.3. Write down 
the transition matrix Tof this Markov chain. What is the probability 
of observing the disease trajectories DDHHD and HDHDH? Cal- 
culate the stationary distribution of the Markov chain. 
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5 Continuous-Time Markov Chains 


A continuous-time stochastic process { X(t), t > 0} with finite state 
space [K] is a continuous-time Markov chain if 


P|X(t+s)= 1| X(s) = k, X(u) = x(u), 0< u < s] 
= P|X(t+s)= 1| X(s)= H 
for all s, t > 1, k, L x(u) E€ [K], 0 < u < s. The chain is homoge- 


neous if Eq. 70 is independent of s. The transition probabilities are 
then denoted: 


Tult) = P[X(t+5) =1| X(s)= El (71) 


(70) 


It can be shown that the transition matrix T(t) is the matrix expo- 
nential of a constant rate matrix R times t: 


T(t) = exp( Rt) = D FR. (72) 


Example 20 (Jukes-Cantor Model): Consider a fixed position in a 
DNA sequence, and let T(t) be the probability that, due to muta- 
tion, nucleotide k changes to nucleotide / after time ¢ at this 
position (Fig. 8). The Jukes-Cantor model [17] is the simplest 
DNA substitution model. It assumes that the transition rates from 
any nucleotide to any other are equal: 


—3a a a a 
a —3a a a 

R=J|a a —3a a ; (73) 
a a a 


Fig. 8 Nucleotide substitution model. The state space and transitions of a 
general nucleotide substitution model are shown. For the Jukes—Cantor model 
(Example 20), all transitions from any nucleotide to any other nucleotide have the 
same probability 1 (1 — e 47") 
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The resulting transition matrix T(t) = exp(Rr) is 


l ER Ze Jor l= eo tat i e tut l= e Ant 
]— e tat J+ 3¢ tat _ e tat Lo e tat 
l= e fat T= e tat l + 3 e tat l= e tat 
= e fat i eo tat l— e tat T+ Ze tat 


(74) 


and the stationary distribution as tf oo is uniform, z = (1/4, 1/4, 


1/4, Lait, 


Example 21 (The Poisson Process): A continuous-time Markov chain 
X(t) is a counting process, if X(t) represents the total number of 
events that occur by time ¢. It is a Poisson process, if in addition 
X(0) = 0, the increments are independent, and in any interval of 
length ż the number of events is Poisson distributed with rate 4t: 


PIX(¢+5) —X() = ETH (75) 


The Poisson process is used, for example, to count mutations in a 
gene. 


Example 22 (Exponential Distribution): The exponential distribu- 
tion Exp(A) with parameter 4 > 0 is a common distribution for 
waiting times. It is defined by the density function: 


f(x) =Ae™, for x > 0. (76) 


If X ~ Exp(A), then X has expectation E(X) = A" and variance 
Var(X) = AT. The exponential distribution is memoryless, which 
means that P(X > s+ t| X> 1) = P(X > s), for all s, t > 0. An 
important consequence of the memoryless property is that the 
waiting times between successive events are i.i.d. For example, the 
waiting times T, (n > 1) between the events of a Poisson process, 
the sequence of interarrival times, are exponentially distributed, 
Ta ~ Exp(A), for all n > 1. 


Exercise 23 (Kimura Model): The Kimura two-parameter model is a 
DNA substitution model that distinguishes between transitions, 
i.e., purine-to-purine and pyrimidine-to-pyrimidine substitutions, 
from transversions, i.e., purine-to-pyrimidine and pyrimidine-to- 
purine substitutions [18]. It is defined by the rate matrix: 


-2B-a p a b 
eae -2B-a p a 
a b -2-a p 
p a p -2$ -a 


where a, PER, are the two substitution rates. Assuming that the 
Markov chain is ergodic, derive its stationary distribution. 
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6 Hidden Markov Models 


A hidden Markov model (HMM) is a statistical model for hidden 
random variables Z = (4, ..., Zz), which form a homogeneous 
Markov chain, and observed random variables X = (X, ..., Xz). 
Each observed symbol X,, depends on the hidden state Z,,. The 
HMM is illustrated in Fig. 9. It encodes the following conditional 
independence statements: 


Zn+1LZn-1 | Zn, 2<n<L-l (Markovproperty) 
(77) 
Kyl Xa | Zn l<mn<L, m#n (78) 


The parameters of the HMM consist of the initial state 
probabilities I = P(Z,), the transition probabilities T}; = P(Z, = 
l| Z,-1 = k) of the Markov chain, and the emission probabilities 
Ers = P(X, = x | Z, = k) of symbols xE ¥. The HMM is denoted 
HMM(U, T, E). For simplicity, we restrict ourselves here to finite 
state spaces Z = |K] of Zand ¥ of X. The joint probability of (Z, X) 
factorizes as: 

L-1 
P(X, Z) = P(Z:) II P(X, Lë | Zn) 
i 
= Iiz, Il e, 


(79) 


The HMM is typically used to model sequence data x = (x), %2, 

.., XL) generated by different mechanisms z, which cannot be 

observed. Each observation x can be a time series or any other 

object with a linear dependency structure [19]. In computational 

biology, the HMM is frequently applied to DNA and protein 

sequence data, where it accounts for first-order spatial dependen- 
cies of nucleotides or amino acids [20]. 


Fig. 9 Hidden Markov model. Shaded nodes represent observed random variables (or symbols) X,, and clear 
nodes represent hidden states (or the annotation). Directed edges indicate statistical dependencies which are 
given, respectively, by transition and emission probabilities among hidden states and between hidden states 
and observed symbols 
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Example 24 (CpG Islands): CpG islands are CG-enriched regions in 
a DNA sequence. They are typically a few hundreds to thousands of 
base pairs long. We want to use a simple HMM to detect CpG 
islands in genomic DNA. The hidden states Z,€Z = {—, +} indi- 
cate whether sequence position n belongs to a CpG island (+ ) or 
not (—). The observed sequence is given by the nucleotide at each 
position, X,E4 = {A,C,G, T}. 

Suppose we observe the sequence «= (C, A, C, G). Then, we can 
calculate the joint probability of x and any state path z by Eq. 79. 
For example, if z = (+, —, —, +), then P(X = x, Z = z) = IL E, 
cT,- El Bang) yg i 


Typically, one is interested in the hidden state path z = (2, 22, 
..., ZL) that gave rise to the observation x. For biological sequences, 
zis often called the annotation of x. In Example 24, the genomic 
sequence is annotated with CpG islands. For generic parameters, 
any state path can give rise to a given observed sequence, but with 
different probabilities. The decoding problem is to find the anno- 
tation z* that maximizes the joint probability: 
z* = argmax P(X = x,Z = 3). (80) 
zEZ 
There are K” possible state paths such that already for sequences of 
moderate length, the optimization problem (Eq. 80) cannot be 
solved by enumerating all paths. 
However, there is an efficient algorithm solving (Eq. 80) based 
on the following factorization along the Markov chain: 
g= 
max P(X,Z)= max P(Z)) 
Z Zi,- ZL Ke 


= max P(Z: | Zi 1)P(Xr | Zt) 
ZL 


1 
PP Xn | Zin) Lat | Zn) 


[... [ max nz | Z2)P(X2 | Z2) 


[ max nz, | Z1)P(X1 | Z1)- E ek 


(81) 


Thus, the maximum over state paths (2, ..., Zz) can be obtained 
by recursively computing maxima over each Z,,. Each of the L terms 
in parenthesis defines a probability distribution over K states by 
maximizing over K values. Hence, the time complexity of the 
algorithm is O(LK’), despite the fact that the maximum is over 
K” paths. This procedure is known as dynamic programming and it 
is the workhorse of biological sequence analysis. For HMMs, it is 
known as the Viterbi algorithm [21]. 
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In order to compute the marginal likelihood P(X = x) of an 
observed sequence x, we need to sum the joint probability P(Z = z, 
X = x) over all hidden states z € Z. The length of this sum is K”, but 
it can be computed efficiently by the same dynamic programming 
principle used for the Viterbi algorithm: 


L-1 
3 PA= A. KOUN 
Z 


Zi,- ZL 
=S 2 PZ, | 21-1) P(X1 | Z1) 
ZL 
[...[ $= RZ: | Z2) P(X | Z2) 
22 
[So z | Z1) P(X | Z1)- P(Z) el 
Zı 


(82) 


Indeed, this factorization is the same as in Eq. 81 with maxima 
replaced by sums. The recursive algorithm implementing (Eq. 82) 
is known as the forward algorithm. In each step, it computes the 
partial solution An, Z,,) = P(X, ..., Xn, Zn). 

The factorization along the Markov chain can also be done in 
the other direction starting the recursion from Zz down to 4. The 
resulting backward algorithm generates the partial solutions Aa, 
Zn) = DEn, Xz | Zn). From the forward and backward quan- 
tities, one can also compute the position-wise posterior state 
probabilities: 

P(X, Za) PX EE Magn? Ag -X1 | Zn) 
ees P(X) 
f(n, Zn) b(n, Zn) 
P(X) l 


(83) 


For example, in the CpG island HMM (Example 24), we can 
compute, for each nucleotide, the probability that it belongs to a 
CpG island given the entire observed DNA sequence. Selecting the 
state that maximizes this probability independently at each 
sequence position is known as posterior decoding. In general, the 
result will be different from Viterbi decoding. 


Example 25 (Pairwise Sequence Alignment): The pair HMM is a 
statistical model for pairwise alignment of two observed sequences 
over a fixed alphabet A. For protein sequences, A is the set of 
20 natural amino acids and for DNA sequences, A consists of the 
four nucleotides, plus the gap symbol (“-”). At each position of the 
alignment, a hidden variable Z,,€Z = {M, X, Y} indicates whether 
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there is a (mis-)match (M), an insertion (X), or a deletion (Y) in 
sequence y relative to sequence x. For example: 


z= MMMMMMMMMMMMMXXMMMMMMMMMMMMYMMMMYMMMMM 
x= CTRPNNNTRKSIRPOIGPGQAF YATGD-—IGDI-—RQAHC 
y= CGRPNNHRIKGLR--—IGPGRAF FAMGAIRGGETIROAHC 


The emitted symbols are pairs (X„, Y,,) of aligned sequence char- 
acters with state space (A x A)\{(—, —)}. Thus, a pairwise align- 
ment is a probabilistically generated sequence of pairs of symbols. 

The choice of transition and emission probabilities corresponds 
to fixing a scoring scheme in nonprobabilistic formulations of 
sequence alignment. For example, the emission probabilities 
Die, b) | m] from a match state encode pairwise amino acid pre- 
ferences and can be modeled by substitution matrices, such as PAM 
and BLOSUM [20]. 

In the pair HMM, computing an optimal alignment between 
x and y means to find the most probable state path 
z* = argmax, P(X = x, Y= y, Z = z), which can be solved using 
the Viterbi algorithm. Using the forward algorithm, we can also 
compute efficiently the marginal probability of two sequences 
being related independent of their alignment, P(X, Y) =XzP 
(X, Y, Z). In general, this probability is more informative than the 
posterior P(Z | X, Y) of an optimal alignment z* because many 
alignments tend to have the same or nearly the same probability 
such that P(Z = z* | X, Y) can be very small. Finally, we can also 
compute the probability of two characters x, and y,, being aligned 
by means of posterior decoding. 


Example 26 (Profile HMM): Profile hidden Markov models repre- 
sent groups of related sequences, such as protein families. They are 
used for searching homologous sequences and for building multi- 
ple sequence alignments. They can be regarded as unrolled versions 
of the pair HMM. A profile HMM is a statistical model for observed 
sequences, which are regarded as i.i.d. realizations. It has site- 
specific emission probabilities E,,(a@) = PX, = a). In its simplest 
form allowing only gap-free alignments, the probability of an 
observation x is just 


P(X =x) = I E,(x;). (84) 


The matrix (E,(@));<n<r, aca is called a position-specific scoring 
matrix (PSSM). 

Profile HMMs can also model indels. Figure 10 shows the 
hidden state space of such a model. It has match states M„, which 
can emit symbols according to the probability tables E. insert 
states I,,, which usually emit symbols in an unspecific manner, and 
delete states D,, which do not emit any symbols. The possible 
transitions between those states allow for modeling alignment 
gaps of any length. 
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Fig. 10 Profile hidden Markov model. The hidden state space and its transitions 
are shown for the profile HMM of length L = 3. Match states are denoted Mp, 
insert states /,, and delete states D. Band E denote silent begin and end states, 
respectively. With match and insert states, probability tables for the emissions of 
symbols (amino acids or nucleotides, and gaps) are associated 


A given profile HMM for a protein family can be used to detect 
new sequences that belong to the same family. For a query sequence 
x, we can either consider the most probable alignment of the 
sequence to the HMM, P(X = x, Z = 2*), or the marginal proba- 
bility independent of the alignment, P(X = x) =X zX = x, Z), 
to decide about family membership. 


Parameter estimation in HMMs is complicated by the presence 
of hidden variables. In Subheading 2, the EM algorithm has been 
introduced for finding a local maximum of the likelihood surface. 
For HMMs, the EM algorithm is known as the Baum—Welch 
algorithm [22]. For simplicity, let us ignore the initial state prob- 
abilities II and summarize the parameters of the HMM by 
0 = (T, E). For ML estimation, we need to maximize the observed 
log-likelihood: 


lons(0) = logP(X | 0) = log X P(X,Z | 8) 


where X}, ..., UNI are the i.i.d. observations. For each observa- 
tion, we can rewrite the joint probability as: 


P(x, Z| 6) = IL II gC., H H phe) 
7 Kell sex * kejk] ies) H 


(86) 


where Nä is the number of x emissions when in state k and 
NZ?) the number of k-to-/ transitions in state path Z? 
(cf. Eq. 68). 
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In the E step, the expectation of Eq. 85 is computed with 
respect to P(Z | X, 6’), where d is the current best estimate of 6. 
We use Eq. 86 and fave by No and Ny the expected value of 
5 NS) and SN, respectively, to obtain 


Ellnia(0)]= X P(Z | X,@)logP(X,Z | 0) 
Z 


= A PZ|X,#) 
ZU g ZN) 
D Nal Zog Eu, + Kä Nu(Z log T y 


kyl 


= S Naclog Ën, + >, NilogT x. 


(87) 


The expected counts N;,,,and Nuare the sufficient statistics [11] of 
the HMM, i.e., with respect to the model, they contain all infor- 
mation about the parameters available from the data. The expected 
counts can be computed using the forward and backward algo- 
rithms. In the M step, this expression is maximized with respect to 
0=(T, E). We find the MLEs Tu Nui, Nän and 


Èr = Nix/diyNy- 


7 Bayesian Networks 


Bayesian networks are a class of probabilistic graphical models 
which generalize Markov chains and HMMs. The basic idea is to 
use a graph for encoding conditional independences among ran- 
dom variables (Fig. 11). The graph representation provides not 
only an intuitive and simple visualization of the model structure, 
but it is also the basis for designing efficient algorithms for infer- 
ence and learning in graphical models [23-25]. 

A Bayesian network (BN) for a set of random variables 
X= (X, ..., Xz) consists of a directed acyclic graph (DAG) and 
local probability distributions (LPDs). The DAG G = (V, E) has 
vertex set V = [L] and edge set E C V x V. Each vertex n € V is 
identified with the random variable X,,. If there is an edge X,, — 
X,,in G, then X,,, is a parent of X, and X, is a child of X,,. For each 
vertex n E V, there is an LPD P(X, | Xpa(n)), where Däin) is the 
set of parents of X, in G. The Bayesian network model is defined as 
the family of distributions for which the joint probability of 
X factors into conditional probabilities as: 


L 
P(X,...,Xz) = IPX, Lëck (88) 
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Fig. 11 Example of a Bayesian network. Vertices correspond to random variables 
and edges represent conditional probabilities. The graph encodes conditional 
independence statements about the random variables U, V, W, X, Y, and Z. 
Their joint probability factors according to the graph as AU, V, W, X, Y) = AU) 
AY )AV | U, YAW VAX | U) 


In this case, we write X ~ BN(G,@), where 0 = (0), ..., Or) 
denotes the parameters of the LPDs. 

For the Bayesian network shown in Fig. 11, we find P( U, V, W, 
X, Y)= PU)PY)RV| U, Y)P(W| V )P(X| U). The graph 
encodes several conditional independence statements about (U, 
V, W, X, Y ), including, for example, W L{U, X} | V. 


Example 27 (Markov Chain): A finite Markov chain is a Bayesian 
network with the DAG X, — Kan, — Xz, denoted C, and 
joint distribution: 


P(X1,...,Xn) = P(X1)P(X2 | X1)P(X3 | X2)---P(Xz | Era, 
(89) 
If X~ MC(II, T) is homogeneous, then the LPDs are 6; = P 
(X1) = Hand 6,4) = Xma | Xn) = T for all n € [L — 1] such 
that MC(I, T) = BN(C,@). Similarly, HMMs are Bayesian net- 


works with hidden variables Z and factorized joint distribution 
given in Eq. 79. 


The meaning of the parameters d of a Bayesian network 
depends on the family of distributions that has been chosen for 
the LPDs. In the general case of a discrete random variable with 
finite state space, d. is a conditional probability table. If each vertex 
X,, has K possible states, then: 


On = (P(X, = 4| Xpan) = HI et eis (90) 


has KP) x (K — 1) free parameters. If X,, depends on all other 
variables, then 0, has the maximal number of Kel parameters, 
which is exponential in the number of vertices. If, on the other 
hand, X, is independent of all other variables, pa(7) = Ø, then 0, 
has (K — 1) parameters, which is independent of L. For the chain 
(Example 27) where each vertex has exactly one outgoing and one 
incoming edge, we find a total of (K — 1) + (L — 1)K(K — 1) free 
parameters which is of order OLK) 
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A popular model for continuous random variables X,, is the 
linear Gaussian model. Here, the LPDs are Gaussian distributions 
with mean a linear function of the parents: 


P(X, | X pa(n)) = Norm(v, + DA . X pa(n)> a. (91) 


with parameters v„ER and mee Rp specifying the mean and 
variance o2. The number of parameters increases linearly with the 
number of parents, but only linear relationships can be modeled. 
All marginal and conditional probabilities of (X1, ..., Xz) are also 
Gaussians. 

Learning a Bayesian network BN(G, 8) from data D can be done 
in different ways following either the Bayesian or the maximum 
likelihood approach as introduced in Subheading 2. In general, it 
involves first finding the optimal network structure: 


ug 
G* = tema P(G| 9), (92) 


and then estimating the parameters: 


d = aera P(0 | G*, D) (93) 


for the given optimal structure G*. The first step is a model selec- 
tion problem as introduced in Subheading 2. 

Model selection for Bayesian networks is a particularly hard 
problem because the number of DAGs increases super- 
exponentially with the number of vertices rendering exhaustive 
searches impractical, and the objective function in Eq. 92 is difficult 
to compute. Recall that the posterior P(G | D) is proportional to 
the product P(D | G)P(G) of marginal likelihood and network 
prior, and the marginal likelihood: 


P(D| G)= [P(o |0, G)P(0 | G) do (94) 


is usually analytically intractable. Here, P(@ | G) is the prior distri- 
bution of parameters given the network topology. 

To address this limitation, the marginal likelihood (Eq. 94) can 
be approximated by a function that is easier to evaluate. A popular 
choice is the Bayesian information criterion (BIC) [26]: 


R 1 
logP(D | G) ~ logP(D | Omi, G) — zvlogN, (95) 


where v is the number of free parameters of the model and N the 
size of the data. The BIC approximation can be derived under 
certain assumptions, including a unimodal likelihood. It replaces 
computation of the integral (Eq. 94) by evaluating the integrand at 
the MLE and adding the correction term —(vlogN)/2, which 
penalizes models of high complexity. 

The model selection problem remains hard even with a tracta- 
ble scoring function, such as BIC, because of the enormous search 
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space. Local search methods, such as greedy hill climbing or 
simulated annealing, are often used in practice. They return a 
local maximum as a point estimate for the best network structure. 
Results can be improved by running several local searches from 
different starting topologies. 

Often, data are sparse and we will find diffuse posterior distri- 
butions of network structures, which might not be represented very 
well by a single point estimate. In the fully Bayesian approach, we 
aim at estimating the full posterior P(G | D) x P(D | G)P(G). One 
way to approximate this distribution is to draw a finite number of 
samples from it. Markov chain Monte Carlo (MCMC) methods 
generate such a sample by constructing a Markov chain that con- 
verges to the target distribution [27]. 

In the Metropolis—Hastings algorithm [28], we start with a 
random DAG G® and then iteratively generate a new DAG 
from the previous one GH" 17 by drawing it from a proposal distri- 
bution Q; 


G” ~ a(G”) | GD), (96) 
The new DAG is accepted with acceptance probability: 


Pin G™) P(IG™ (2-1) | et? 
min{ POLE) PG OIG? |G) D 
P(o | G™)P(GP) OG |G) 


Otherwise, the model is left unchanged and the next sample is 
drawn. With this acceptance probability, it is guaranteed that the 
Markov chain is ergodic and converges to the desired distribution. 
After an initial burn-in phase, samples from the stationary phase of 
the chain are collected, say G”, ..., GO. Any feature f of the 
network (e.g., the presence of an edge or a subgraph) can be 
estimated as the expected value: 


N 
KSE (98) 


n=m 


A critical point of the Metropolis—Hastings algorithm is the choice 
of the proposal distribution Q, which encodes the way the network 
space is explored. Because not all graphs, but only DAGs, are 
allowed, computing the transition probabilities Q(G® | G”~””) is 
usually the main computational bottleneck. 

Parameter estimation, i.e., solving (Eq. 93), can be done along 
the lines described in Subheading 2 following either the ML or the 
Bayesian approach. If the model contains hidden random variables, 
then the EM algorithm (Subheading 3) can be used. However, this 
approach is feasible only if efficient inference algorithms are avail- 
able. For hidden Markov models (Subheading 6), the forward and 
backward algorithms provided an efficient way to compute mar- 
ginal probabilities and the expected hidden log-likelihood. These 
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algorithms can be generalized to the sum—product algorithm for 
tree-like graphs and the junction tree algorithm for general DAGs. 
The computational complexity of the junction tree algorithm is 
exponential in the size of the largest clique of the so-called mor- 
alized graph, which is obtained by dropping edge directions and 
adding edges between any two vertices that have a common child in 
the original DAG [11]. 

Alternatively, if exact inference is computationally too expen- 
sive, then approximate inference can be used. For example, Gibbs 
sampling [29 ] isan MCMC gna for generating a sample from 
the joint distribution P(X, ..., Xz). The idea is to iteratively 
E from the conditional piobabilitics of P(X, ..., Xz), starting 
with X} SE e P(X) | x”, shes xy d and cycling through all vari- 
ables in turns: 

xe ~ P(X; xo ing e 8 

for all 7 = 2,...,L. 


oX) ` on 


Gibbs sampling can be regarded as a special case of the Metropo- 
lis—Hastings algorithm. It is particularly useful, if it is much easier to 
sample from the conditionals P(X, | X\+) than from the joint dis- 
tribution P(X, ..., Xz), where X\, denotes all variables X, except 
Xp. For graphical models, the conditional probability of each vertex 
X, depends only on its Markov blanket Xmg($), defined as the set of 
its parents, children, and co-parents (vertices with the same chil- 
dren), P(X; | Xa) = P(X: | Xmvca)). 


Example 28 (Phylogenetic Tree Models): A phylogenetic tree model 
[30] for a set of aligned DNA sequences from different species is a 
Bayesian network model, where the graph is a tree in which the 
leaves represent the observed contemporary species and the interior 
vertices correspond to common extinct ancestors (Fig. 12). The 
topology (graph structure) S defines the branching order and the 
branch lengths correspond to (phylogenetic) time. The LPDs are 
defined by a nucleotide substitution model (Subheading 5). 

Let X € {a, C, G, T}” denote the i-th column of a multiple 
sequence alignment of L observed species. We regard the alignment 
columns as independent observations of the evolutionary process. 
The character states of the hidden (extinct) ancestors are denoted 
Z”. The likelihood of the observed sequence data X = (OO asses 
X?) given the tree topology $ and the branch lengths tis 


P(X | S,t) Sch Dax Z) |S, 2), (100) 


where P(X, Z| S, t) factors into conditional probabilities 
according to the tree structure. This marginal probability can be 
computed efficiently with an instance of the sum—product algorithm 
known as the peeling algorithm (or Felsenstein algorithm) [31]. 
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Fig. 12 Phylogenetic tree model. The observed random variables X; represent 
contemporary species and the hidden random variables Z; their unknown 
common ancestors 


For example, in the tree displayed in Fig. 12, each observation 
X has probability: 


P(X) = }_ P(X,Z) (101) 
Z 


= XC P(X: | ZPX | Zi) P(Xs | Z1)P(X4 | Z2): 
Z 


(102) 
- P(Xs | Z2)P(Z1 | Z3)P(22 | Z3)P(Z3 | Z4)P(Z4) 
= XC P(Zs)P(X1 | Za) |X P(Z3 | Za) 
Z4 Z3 
XC P(Z | Z3)P(X4 | Z2)P(Xs | Z2) (103) 


Z2 


NO P(Zi | Z3)P(X2 | Z1)P(Xs | Z1)] |, 
Zi 
where we have omitted the dependency on the branch length ¢. 


Several software packages implement ML or Bayesian learning of 
phylogenetic tree models. 


In the simplest case, we suppose that the observed alignment 
columns are independent. However, it is more realistic to assume 
that nucleotide substitution rates vary across sites because of vary- 
ing selective pressures. For example, there could be differences 
between coding and noncoding regions, among different regions 
of a protein (loops, and catalytic sites), or among the three bases 
of a triplet coding for an amino acid. More sophisticated models 
can account for this rate heterogeneity. Let us assume site-specific 
substitution rates 7; such that the local probabilities become 
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P(X” | r, t, 8). To model the distribution of the rates, often a 
gamma distribution is used. 


Example 29 (Gamma Distribution): The gamma distribution 
Gamma(a,/) is parametrized by a shape parameter o and a rate 
parameter £. It is defined by the density function: 


ES pe a—l ,—px 

f(x) ~~ T(a) X 4 H 

Its expectation is E(X) = a/f and its variance Var(X) = a/p”. The 
gamma distribution generalizes several other distributions, for 
example Gamma(1,/) = Exp(A) (Example 22). 


for x > 0. (104) 


Another approach to account for varying mutation rates are 
phylogenetic hidden Markov models (phylo- HMMs). 


Example 30 (Phylo-HMM): Phylo-HMMs [32] combine HMMs 
and phylogenetic trees into a single Bayesian network model. The 
idea is to use an HMM along the linear chain of the genomic 
sequence and, at each position, to condition a phylogenetic tree 
model on the hidden state (Fig. 13). This architecture allows for 
modeling different evolutionary histories at different sites of the 
genome. In particular, the model can account for heterogeneity in 
the rate of evolution, for example, due to functionally conserved 
elements, but it also allows for a change in tree topology along the 
sequence, a situation that can result from recombination [23]. 
Phylo-HMMs are also used for gene finding. 


Exercise 31 (Inference in Bayesian Networks): Consider the gene 
network on five genes denoted A, B, C, D, E, with the graph 
structure displayed below. Gene expression profiles under different 
conditions C1—C9 have been observed and are summarized in the 
table below, where a zero indicates that the gene is not expressed 
and a one that it is expressed. 


Fig. 13 Phylo-HMM. Shown are the first four positions of a Phylo-HMM. The hidden Markov chain has random 
variables Z. In the trees, Y denote the hidden common ancestors and X the observed species. Note that the 
tree topology changes between position 2 and 3 


A BCDE 
C10 0 0 0 0 
C2}0 0 0 0 1 
C310 0 0 0 1 
C4} 1 1 1 1 0 
C5} 1 0 1 1 0 
C60 0 0 1 1 
C71 1 1 1 0 
C81 0 0 0 1 
C91 0 0 1 1 

(a) 

(b) 

(c) 

(d) 
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Specify the adjacency matrix of the directed graph. 


Determine the local probability distributions for each vertex of 
the graph. Use conditional counting to determine the condi- 
tional probabilities as: 


N(Xi,X 


x pa 
P(X; | A sate) We X N(X; = 


ER 
X pati) 


where N(X;, Xpa(i)) is the number of joint observations of X; 
and its parents. 

What is the joined probability of (X4, Xg, Xc, Xp, Xz) for this 
network? 
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Abstract 


In this chapter, we give a not-so-long and self-contained introduction to computational molecular evolu- 
tion. In particular, we present the emergence of the use of likelihood-based methods, review the standard 
DNA substitution models, and introduce how model choice operates. We also present recent developments 
in inferring absolute divergence times and rates on a phylogeny, before showing how state-of-the-art 
models take inspiration from diffusion theory to link population genetics, which traditionally focuses at a 
taxonomic level below that of the species, and molecular evolution. Although this is not a cookbook 
chapter, we try and point to popular programs and implementations along the way. 


Key words Likelihood, Bayes, Model choice, Phylogenetics, Divergence times 


1 Introduction 


Many books [1-7] and review papers [8—10] have been published 
in recent years on the topic of computational molecular evolution, 
so that updating our previous primer on the very same topic [11] 
may seem redundant. However, the field is continuously under- 
going changes, as both models and algorithms become even more 
sophisticated, efficient, robust, and accurate. This increase in refine- 
ment has not been motivated by a desire to complicate existing 
models, but rather to make an old wish come true: that of having 
integrated methods that can take unaligned sequences as an input, 
and simultaneously output the alignment, the tree, and other esti- 
mates of interest, in a sound statistical framework justified by sound 
principles: those of population genetics. 

The aim of this chapter is still to provide readers with the 
essentials of computational molecular evolution, offering a brief 
overview of recent progress, both in terms of modeling and algo- 
rithm development. Some of the details will be left out as they are 
dealt with by others in this volume. Likewise, the analysis of 
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genomic-scale data is briefly touched upon, but the details are left 
to other chapters. 


2 Parsimony and Likelihood 


2.1 A Brief Overview 
of Parsimony 


The simplest phylogenetic question pertains to the reconstruction 
of a rooted tree with three sequences (Fig. 1). The sequences can be 
made of DNA, RNA, amino acids, or codons, but for the sake of 
simplicity we focus on DNA throughout this chapter. In the basic 
example below, based on [12], DNA sequences are assumed to have 
been sampled from three different species that diverged a “long 
time ago.” In this context, we assume that the data or gene 
sequences have been aligned (see Subheading 6), and that the 
DNA alignment is: 


bi ATGACCCCAATACGCAAAACTAACCCCCTAATAAAATTAATTAACCACTCCTTC 


Sy ATGACCCCAATACGGAAAACTAACCCCCAAATAAAATTAATTAACCACTCATTC 


53 ATGACGCCAATACGCAAAACTAACCGCCTAATAAAATTAATTTACCACTCATTC 


The objective is to estimate which of the three fully resolved 
topologies in Fig. ] is supported by the data. In order to go further, 
we recode the data in terms of site patterns, which correspond to 
the patterns observed in each column of our alignment. This recod- 
ing implies that columns, or sites, in our alignment evolve accord- 
ing to an identically and independently distributed (iid) process. 
With this in mind, our alignment can be recoded in the following 
manner. When all the characters (nucleotides) in a column are 
identical, the same letter is assigned to each character, for example, 
x, irrespective of the actual character state. When a substitution 
occurs in one of the three sequences, we have three corresponding 
site patterns: xxy, xyx, and yxx, where the order within each site 
pattern respects the order of the sequences in the alignment, 515553. 


ty ` T, T, 
S4 S, S} S, Sp S} S, S4 S3 S, Sp S3 


Fig. 1 The simplest phylogenetic problem. With three species, Sı So, and S3, four 
rooted trees are possible: Tọ, the star tree, and the three resolved topologies 
T,-T3 
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XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXYÝXXX 


XXXXXXXXXXXXXXYÝXXXXXXXXXXXXXYÝXXXXXXXXXXXXXXXXXXXXXXXXX 


XXXXXYXXXXXXXXXXXXXXXXXXXYÝXXXXXXXXXXXXXXXXÝXXXXXXXXXXX 


Table 1 
The winning-site strategy 


Site pattern Supported T; Count 


XxX To 48 
XXY Ti 3 
XyX T 2 
yxX T3 1 


The data alignment is reduced to a frequency table of site patterns. In the case of three 
sequences, only the last three site patterns are informative 


The first informative site pattern, xxy, implies that at this 
particular site, sequences sı and s are more similar than any of 
these to s3, so that this site pattern supports topology T}, which 
groups sequences pn and s together (Fig. 1). The most intuitive 
idea, called the winning-site strategy, is that the topology supported 
by the data corresponds to the fully resolved topology that has the 
largest number of site patterns in its favor. In the example shown 
above, topology Tj is supported by three columns (with site pattern 
xxy), topology T> by two columns (xyx), and 73 by one column 
(yxx; see Table 1). This is the intuition behind parsimony, which 
minimizes the amount of change along a topology. Strictly 
speaking, unordered parsimony cannot distinguish these three 
trees as they all require at least one single change. Yet, it can be 
argued that if tree T) is the true tree, site pattern xxy is more likely 
than any other patterns as xxy requires at least one change along a 
long branch (the one leading to sequence s3) while both xyx and 
yxx require a change along a short branch (see p. 28 sqq. in [13]; 
EN 

A number of methodological variations exist. A very condensed 
overview can be found in the books by Durbin [14] or, with more 
details, Felsenstein [15]. Most computer programs that implement 
substitution models where sites are iid condense the alignment as 
an array of site patterns; some, like PAML [16], even output these 
site patterns. 

Note that in obtaining this topology estimate, most of the site 
columns were discarded from our alignment (all the xxx site pat- 
terns, representing 89% of the site in our example above). Most of 
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2.2 Assessing the 
Reliability of an 
Estimate: The 
Bootstrap 


our data were phylogenetically uninformative (for parsimony). We 
also failed to take evolutionary time into account, or any process of 
basic molecular biology, such as the observation that transitions 
(substitution of a purine [A or G] by a purine, or a pyrimidine by a 
pyrimidine) are more frequent than transversions (substitution 
between a purine and a pyrimidine). 


As with any statistical exercise estimating a quantity of interest, we 
would like to have a confidence interval, taken at a particular level, 
so that we can gauge the reliability of our estimate. A standard 
approach to derive confidence intervals is the bootstrap [17], a 
computational technique that resamples data points with replace- 
ment to simulate the distribution of any test statistic under the null 
hypothesis that is tested. The bootstrap, particularly useful in com- 
plicated nonparametric problems where no asymptotic results can 
be obtained [18], was adapted by Felsenstein to the nonstandard 
phylogenetic problem [19]. Indeed, the problem is nonstandard in 
that the object for which we wish to assess accuracy is not a real- 
valued parameter, but a graph. 

The basic idea, clearly explained in [20], consists in resampling 
columns of the alignment, with replacement, to construct a “syn- 
thetic” alignment of the same size as the original alignment. This 
synthetic or bootstrap replicate is then subjected to the same tree- 
reconstruction algorithm used on the original data (Fig. 2). This 
exercise is repeated a large number of times (e.g., x 10°), and the 
proportion of each original bipartition (internal node) in the set of 
bootstrapped trees is recorded. In Fig. 2, for instance, the bipar- 
tition aple is found in two bootstrap trees out of three, so the 
bootstrap support for this node is 66.7%. In this simple case with 
three sequences, the bootstrap support for topology Tı is also 
66.7%. This bootstrap proportion for topologies (or for trees 
when branch lengths are taken into account, in a maximum likeli- 
hood context, for instance—see below) can be computed very 
quickly by bootstrapping the sitewise log-likelihood values, instead 
of the columns of the alignment; this bootstrap is called RELL, for 
“resampling estimated log-likelihood” [21]. 

However, this approach is no longer used or cited extensively 
since 2008 (source: ISI Thompson). One alternative that has 
gained momentum is the one based on the approximated likelihood 
ratio test (aLRT) [22], implemented, for instance, in phyml 
[23, 24]. Instead of resampling any quantity (sites or sitewise 
log-likelihood values), the aLRT tests the null hypothesis that an 
interior branch length is zero. In spite of being slightly conservative 
in simulations, the approach is extremely fast and hence highly 
practical [22]. 

The meaning of the bootstrap has been a matter of debate for 
years. As noted before [8] (see also [22 ]), the bootstrap proportion 
P can be seen as assessing the correctness of an internal node, and 


Original sequence alignment 


000000000111111111122222222223333333333444444444455555 
123456789012345678901234567890123456789012345678901234 
ATGACCCCAATACGCAAAACTAACCCCCTAATAAAATTAATTAACCACTCCTTC 
ATGACCCCAATACGGAAAACTAACCCCCAAATAAAATTAATTAACCACTCATTC 
ATGACGCCAATACGCAAAACTAACCGCCTAATAAAATTAATTTACCACTCATTC 


Bootstrap replicate #1 


043053000522400123244401023400123244440012324440144321 
825507119163149560338088219149560338014956033806238973 
CTACCTAAACCATAACCAAACACATTATAACCAAACATAACCAAACACAACACC 
CTACCTAAAACATAAGCAAACACATTATAAGCAAACATAAGCAAACACAACACC 
CTACCTAAAAGATAACGAATCACATTATAACGAATCATAACGAATCAGATCACC 


Bootstrap replicate #2 


101232414430531044010200102324143441001230240123201231 
595603350255075180882134566033505946455604719560395605 
CACCAAACATACCTCACACATTGACCCAAACAATAAACCCAACTACCAAACCAC 
GAGCAAAGATACCTGACACATTGAGCCAAAGAATAAACGCAACTAGCAAAGCAG 
CACGAATCATACCTCACACATTGACGGAATCAATAAACCGAACTACGAAACGAC 


Bootstrap replicate #3 


244401443212401232043051232444444321211111130202324004 
338062389737195603825505603380238973712345983923570921 
AACACAACACCCTACCAACTACCCCAAACATACACCCTACGCATGTTAACAATT 
AACACAACACCCTAGCAACTACCGCAAACATACACCCTACGGATGATAACAATT 
ATCAGATCACCCTACGAACTACCCGAATCATTCACCCTACGCATGTTAACAATT 
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T, 
8S, S 8 
T, 
—> 2/3 
s, ‘/s, Aë S S, 
T, 
Sı "ës tS 


Fig. 2 The (nonparametric) bootstrap. See text for details 


failing to do so [25], or 1 — P can be interpreted as a conservative 
probability of falsely supporting monophyly [26]. Since bootstrap 
proportions are either too liberal or too conservative depending on 
the actual interpretation of P [27], it is difficult to adjust the 
threshold below which monophyly can be confidently ruled out 
[28]. Alternatively, an intuitive geometric argument was proposed 
to explain the conservativeness of bootstrap probabilities [18] and 
was further developed into the approximately unbiased or AU test, 
implemented in CONSEL [29]. In spite of these difficulties, the 
bootstrap is still widely used—and mandatory in all publications 
featuring a phylogeny—to assess the confidence one can have in the 
tree estimated from the data under a particular scheme or model 
(see Subheading 2.9.3 below). Lastly, note that bootstrap support 
has often been abused [30], as a high value does not necessarily 
indicate high phylogenetic signal, and can be the result of system- 
atic biases [31 ] due to the use of the wrong model of evolution, for 
instance, as detailed below. 
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2.3 Parsimony and 
LBA 


Now that we have a means of evaluating the support for the 
different topologies, we can test some of the conditions under 
which parsimony estimates the correct tree topology. Ideally, a 
good method should return the correct answer with a probability 
of one when the number of sites increases to infinity. This desirable 
statistical property is called consistency. One serious criticism of 
parsimony is its sensitivity to long branch attraction, or LBA, even 
in the presence of an infinite amount of data (infinite alignment 
length) [31]. In other words, parsimony is not statistically 
consistent. 

Different types of model misspecification can lead to LBA, and 
new ones are continually identified. The topology originally used to 
demonstrate the artifact is represented in Fig. 3, where two long 
branches are separated by a shorter one. Felsenstein demonstrated 
that, under a simple evolutionary process, the artifact or LBA tree is 
reconstructed. Note that parsimony is not the only phylogenetic 
method affected by LBA, but because it posits a very simple model 
of evolution [32-34], parsimony is particularly sensitive to the 
artifact. In spite of this, one particular journal chose to enforce 
the use of parsimony, stating that authors should estimate their 
phylogenies by parsimony but also that, if estimated by some 
other method, they would need to defend their position “on phil- 
osophical grounds” [35]; there is of course no valid scientific 
justification for taking such a step—derided in the “Twittersphere” 
as “#parsimonygate.” 

The LBA artifact has been shown to plague the analysis of 
numerous data sets, and a number of empirical approaches have 
been used to detect the artifact [36, 37]. Most recent papers based 
on multigene analyses (e.g., [38, 39]) now examine carefully the 
effect of across-site and across-lineage rate variation (in addition to 
the use of heterogeneous models). For both sites and lineages, the 


(a) 3 
(b) 
True tree topology 
Ei 2 
S ee 
Attract tree topology S 


Tree topology 
(in absence of LBA) 


Fig. 3 The long branch attraction artifact. The true tree topology has two long branches separated by a short 
one. The tree reconstructed under a simple model of evolution (a) is the artifact or LBA tree on the left. The tree 
reconstructed under the correct model of evolution (b) is the correct tree, on the right 
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Le 


7 
7 7 ideal distance: linear with time 


7 sites begin to saturate with 
7 multiple substitutions 


Observed number of substitutions 


present Geological time (and actual number of substitution) past 


Fig. 4 Saturation of DNA sequences. As time increases, the observed number of differences between pairs of 
sequences reaches a plateau, whereas the actua/ number of substitutions keeps increasing 


procedure is the same and consists in successively removing either 
the sites that evolve the fastest or the taxa that show the longest 
root-to-tip branch lengths. 


2.4 Origin of the By definition, parsimony minimizes the number of changes along 

Problem each branch of the tree. When there is only a small number of 
changes per branch, the method is expected to be accurate. How- 
ever, when sequences are quite divergent, the parsimony assump- 
tion leads to underestimating the actual number of changes (Fig. 4; 
see also [40]). 

Consequently, we would like a tree-reconstruction method that 
accounts for multiple substitutions. We would also like a method 
that (1) takes into account less parsimonious as well as most parsi- 
monious state reconstructions (intervals, tests), (2) weights changes 
differently if they occur on branches of different length (evolution- 
ary time), and (3) weights different kinds of events (transitions, 
transversions) differently (biological realism). Likelihood methods 
include such considerations explicitly, as they require modeling the 
substitution process itself. 


2.5 Modeling The basic model of DNA substitution (Fig. 5) is defined on the 
Molecular Evolution DNA state space, made of the four nucleotides thymine (T), cyto- 
sine (C), adenine (A), and guanine (G). Note that Tand C are 
pyrimidines (biochemically, six-membered rings), while A and 
G are purines (fused five- and six-membered heterocyclic com- 
pounds). Depending on these two biochemical categories, two 
different types of substitutions can happen: transitions within a 
category, and transversions between categories. Their respective 
rates are denoted o and £ in Fig. 5. 
The process we want to model should describe the substitution 
process of the different nucleotides of a DNA sequence. Again, we 
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pyrimidines (Y ren C 
X 
WA 
Í a: transitions 
purines (R) A= > G svers 
o 


Fig. 5 Molecular evolution 101. Specification of the basic model of DNA 
substitution 


will make the simplifying assumption that sites evolve under a time- 
homogeneous Markov process and are iid, as above. We can there- 
fore concentrate on one single site for now (e.g., [41]). 

At a particular site, we want to describe the change in nucleo- 
tide frequency after a short amount of time dt, during which the 
nucleotide frequency of A, for instance, after dt will change from 
falt) to falt + dt). According to Fig. 5, f(t + dt) will be equal to 
what we had at time ¢, f4(t), minus the quantity of A that “dis- 
appeared” by mutation during dz, plus the quantity of A that 
“appeared” by mutation during dt. Denoting the mutation rate as 
4, the quantity of A that “disappeared” by mutation during dt is 
simply f4(t)\uadt. These mutations away from A generated quanti- 
ties of T, C, and G, in which we are not interested at the moment 
since we only want to know what happens to A. There are three 
different ways to generate A: from either T, C, or G (Fig. 5). 
Coming from T, mutation will generate fr(t)ur=-adt of A during 
at. Similar expressions exist for C and for G, so that in total, over 
the three non-A nucleotides, mutation will generate 
ZA EIN: At. Mathematically, we can express these ideas as: 


Falt dt) =f a(t)—faltuadt + A f (Omiade a) 


IZA 


Equation 1 describes the change of frequency of A during a 
short time interval dt. Similar equations can be written for T, C, and 
G, so that we actually have a system of four equations describing the 
change in nucleotide frequencies over a short time interval dt: 


frt + dt) =frlt)—frldurdt +} r fil 

) t)ucdt+ X o filtmicdt 
)—faltuadt+ Do, filtMiadt 
=felt)—fol)ucdtt+ A ane filOmicdt 


t)u;rdt 


which, in matrix notation, can simply be rewritten as: 
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F(t + dt) = F(t) + QF(t)dt (3) 


with an obvious notation for F, while the instantaneous rate matrix 


Qis 


"Hr Frc Fra Hire 
Hc —Hc Hc HCG 
Q= CT ë Pra CG (4) 
HAT Fac Ha HAG 
Her Heoc Hea "Ho 


In all the following matrices, we will use the same order for nucle- 

otide: T, C, A, and G, which follows the order in which codon 

tables are usually written. Recall that y;; is the mutation rate from 

nucleotide 7 to nucleotide 7. Note also that the sum of each row is 0. 
Let us rearrange the matrix notation from Eq. 3 as: 


F(t+ dt) — F(t) = QF(t)dt (5) 
and take the variation limit when dt — 0: 
dEr) 
~> = LEI 6 
D _ one (6) 
which is a first-order differential equation that can be integrated as: 
F(t) = e2* F(0) (7) 


Very often, this last equation 7 is written as Kt) = P(t) (0), where 
F(0) is conveniently taken to be the identity matrix and P(t) = {P,, 
A0 = e& is the matrix of probabilities of going from state 7 to 
j during a finite time duration ¢. Note that the right-hand side of 
this equation is a matrix exponentiation, which is not the same as 
the exponential of all the elements (row and columns) of that 
matrix. The computation of the term e% demands that a spectral 
decomposition of the matrix Q be realized. This means finding a 
diagonal matrix D of eigenvalues and a matrix M of (right) eigen- 
vectors so that: 


P(t) = Melu ` (8) 


The exponential of the diagonal matrix D is simply the exponential 
of the diagonal terms. 

Except in the simplest models of evolution, finding analytical 
solutions for the eigenvalues and associated eigenvectors can be 
tedious. As a result, numerical procedures are employed to solve 
Eq. 8. Alternatively, a Taylor expansion can be used to approximate 
P(t). 

If all entries in Q are positive, any state or nucleotide can be 
reached from any other in a finite number of steps (all states 
“communicate”) and the base frequencies have a stationary distri- 
bution z = (ën 4c, a, Aq). This is the steady state reached after an 
“infinite” amount of time, or long enough for the Markov process 
to forget its initial state, starting from “random” base frequencies. 
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2.6 Computation on 
a Tree 


i (fictive root) 


Fig. 6 Likelihood computation on a small tree. See text for details 


Now that we know how to determine the rate of change of nucleo- 
tide frequencies during a time interval dt, we can compute the 
probability ofa particular nucleotide change on a tree. The simplest 
case, though somewhat artificial with only two sequences, is 
depicted in Fig. 6. 

We are looking at a particular nucleotide position, denoted J, 
for two aligned sequences. The observed nucleotides at this posi- 
tion are Tin sequence 1, and Cin sequence 2. The branch separat- 
ing T from C has a total length of tọ + 4. For the sake of 
convenience, we set an arbitrary root along this path. The likeli- 
hood at site J is then given by the probability of going from the 
fictive root 7 to Tin tọ, and from 7 to C in pn. Any of the four 
nucleotides can be present at the fictive root. As we do not know 
which one was there, we sum these probabilities over all possible 
state, weighted by their prior probabilities, the equilibrium fre- 
quencies z;. In all, we have the likelihood feat site /: 


G= A. ` aiPi,r(to)Pic(t) 
i={T,C,A, G} 


(9) 


which is equivalent to the Chapman—Kolmogorov equation 
[42]. As all the sites are assumed to be iid, the likelihood of an 
alignment is the product of the site likelihoods in Eq. 9. Because all 
these sitewise probabilities can be small, and that the product of 
small numbers can become smaller than what a computer can 
represent in memory (underflow), all computations are done on a 
logarithmic scale and may include some form of rescaling [43]. 
Note that this example is somewhat artificial: with only 
two sequences, we can compute the likelihood directly with 27Pr 
At + t) = tcPco(t + t); the full summation over unknown 
states as in Eq. 9 is required with three sequences or more. When 
analyzing a multiple-sequence alignment of S sequences, there will 
be many nodes in the tree for which the character state is unknown, 
which means that the summation required will involve many terms. 
Specifically, the sum will be over A7 "7 terms. Fortunately, terms can 
be factored out of the summation, and a dynamic programing 
algorithm with a complexity of the order of O(4’S), called the 
pruning algorithm [44], can be used (see [15] for details). 


2.7 Substitution 
Models and 
Instantaneous Rate 
Matrices Q 
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Now that we have almost all the elements to compute the likeli- 
hood of a set of parameters, including the tree (De, the set of 
branch lengths and the tree topology; see Subheading 2.10), the 
only missing element required to compute the likelihood at each 
site, as in Eq. 9, for instance, is the specification of the instanta- 
neous rate matrix Q as in Eq. 4. Remember that the u; ; represent 
mutation rates from state (nucleotide) 7 to 7. This matrix is generally 
rewritten as: 


= TTC TTA TTG 


’CT = TCA TCG 


Q=n (10) 


’GT YGC TYGA — 


so that each entry ry is a rate of change from nucleotide 7 to 
nucleotide 7. The diagonal entries are left out, indicated by a “—,” 
and are in fact calculated as the negative sum of the off-diagonal 
entries (as rows sum to 0). 

The simplest specification of Q would be that all rates of change 
are identical, so that Q becomes (leaving out the mutation rate y 
and indexing the matrix to indicate the difference): 


= 111 
l - 1 1 

Qe = llil (11) 
l l1 1 - 


which is the model proposed by Jukes and Cantor [45] and often 
noted “JC” or “JC69”. Under the specification of Eq. 11, this 
model has no free parameter. The process is generally scaled such 
that the unit of branch lengths can be interpreted as an expected 
number of substitutions per site. 

Of course, this model is extremely simplistic and neglects a fair 
amount of basic molecular biology. In particular, it overlooks two 
observations. First, base frequencies are not all equal in actual DNA 
sequences, but are rather skewed, and second, transitions are more 
frequent than transversions (see Subheading 2.5). 

The way to account for this first “biological realism” is as 
follows. If DNA sequences were made exclusively of As, for 
instance, that would mean that all mutations are towards the 
observed base, in this case A, whose equilibrium or stationary 
frequency is z4. The same reasoning can be used for arbitrary 
equilibrium frequencies m, so that all relative rates of change in 
Q become proportional to the vector of equilibrium frequency z 
of the target nucleotide. In other words, the instantaneous rate 
matrix Q becomes: 
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Qrsı = (12) 


IT AC — AG 
ZT AC HA — 


again with the requirement that rows sum to 0. This matrix repre- 
sents the Felsenstein or F81 model [44]. This model has four 
parameters (the four base frequencies), but since base frequencies 
sum to 1, we only have three free parameters. 

The second “biological realism,” accounting for the different 
rates of transversions and transitions, can be described by saying 
that transitions occur x times faster than transversions. From Fig. 5, 
recall that transitions are mutations from Tto C (and vice versa) and 
from A to G (and vice versa). This translates into: 


— k 1l 1 
k — 1 1 

Qkgo = ll -k (13) 
l l k — 


This model is called the Kimura two-parameter model or K80 
(or K2P) [46]. The model is alternatively described with the two 
rates a and p (see Fig. 5). In the “x version” of the model as in 
Eq. 13, there is only one free parameter. 

Of course it is possible to account for both kinds of “biological 
realism,” unequal equilibrium base frequencies and transition bias, 
all in the same model, whose generator Q becomes: 


E TCK TA IG 


TTK — TA TG 


Qnr = (14) 
UT TC = TGK 


UT TC TAK — 
which corresponds to the Hasegawa—Kishino-Yano or HKY 
(or HKY85) model [47]. This model has four free parameters: x 
and three base frequencies. 
The level of “sophistication” goes “up to” the general time- 


reversible model [48], denoted GTR or REV, which has for 
generator: 


Qerr = (15) 


2.8 Some 
Computational 
Aspects 


2.8.1 Optimization of the 
Likelihood Function 
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The number of free parameters is now eight (three base frequencies 
plus five nucleotide propensities). The name is derived from the 
time-reversibility constraint, which implies that the likelihood is 
independent of the actual orientation of time. 

In fact, there exist only a few “named” additional substitution 
models [15], most of which are time-reversible models, while a 
total of 203 models can be derived from GTR [49]. We have 
focused solely on DNA models in this chapter, but the problem is 
similar with amino acid or codon models, except that the number 
of parameters increases quickly. We have also limited ourselves to 
time-reversible time-homogeneous models, but irreversible 
non-homogeneous models were developed some time ago [50] 
and are used, for instance, to root phylogenies [51] or to help 
alleviate the effects of LBA [39]. 


For a given substitution model, how should parameters be esti- 
mated, given the (potentially) high dimensionality of the model? 
Analytical solutions consist in determining when the first derivative 
of the likelihood function is equal to zero (with a change of sign in 
the second derivative). However, finding the root of the likelihood 
function analytically is only possible in the simple case of three 
sequences of binary characters under the assumption of the molec- 
ular clock (see Subheading 3.1) [12]. As a result, numerical solu- 
tions must be found to maximize the likelihood function. 

A number of ideas have been combined to search efficiently for 
the parameter values that maximize the likelihood function. Most 
programs will start from a random starting point, for example, 


(a, 0), denoted by an x in Fig. 7, where we limit ourselves to 
a two-parameter example. The optimization procedure can follow 


(A 7 


parameter 62 


WAL. 


parameter 62 


parameter 01 parameter 61 


Fig. 7 Two optimization strategies. The likelihood surface of a function with two parameters 6, and 0; (e.g., 
two branch lengths) is depicted as a contour plot, whose highest peak is at the + sign. (a) Optimization one 
parameter at a time. (b) Optimization of all parameters simultaneously. See text for details 
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2.8.2 Convergence 


2.9 Selection of the 
Appropriate 
Substitution Model 


one of the two strategies. In the first one, parameters are optimized 
one at a time. In Fig. 7a, parameter 0, is first optimized to maximize 
the likelihood function with a line search, which defines a direction 
along which the other parameter (02) or parameters in the multidi- 
mensional case are kept constant. Once Di is found, a new 
direction is defined to optimize 62, and so on so forth until conver- 
gence to the maximum of the likelihood function. As shown in 
Fig. 7a, many iterations can be required, in particular when the 
parameters 0) and 0, are correlated. The alternative to optimizing 
one parameter at a time is to optimize all parameters simulta- 
neously. In this case (Fig. 7b), an initial direction is defined at 


(a, a) such that the slope at this point is maximized. The 
process is repeated until convergence. More technical details can 
be found in [52]. The simultaneous optimization procedure gener- 
ally requires fewer steps than optimizing parameters one at a time, 
but not always. Since the computation of the likelihood function is 
the most expensive computation of these algorithms, the simulta- 
neous optimization is much more efficient, at least in our toy 
example. 

How general is this result? Simultaneously optimizing para- 
meters of the substitution model, while optimizing branch lengths 
one at a time, was shown to be more effective on large data sets 
[43], potentially because of the correlation that exists between 
some of the parameters entering the Q matrix (see Subheading 2.7). 


Convergence is usually reached either when the increment in the 
log-likelihood score becomes smaller than an e value, usually set to 
a small number such as 107° (but yet a number larger than the 
machine e: the smallest number that a given computer can repre- 
sent), or when the log-likelihood score has not changed after a 
predetermined number of iterations. However, none of these stop- 
ping rules guarantees that the global maximum of the likelihood 
function has been found. Therefore, it is generally recommended to 
run the optimization procedure at least twice, starting from differ- 
ent initial values of the model parameters, and to check that the 
likelihood score after optimization is the same across the different 
runs (Fig. 8). If this is not the case, additional runs may be required, 
and the one with the largest likelihood is chosen for inference (e.g., 
[53]). 

In many instances though, different substitution models will 
give different tree topologies, and therefore different biological 
conclusions. One difficulty is therefore to know which model 
should be used to analyze a particular data set. 


One important issue in model selection is about the trade-off 
between bias and variance [55]: a simple model will fail to capture 
all the sophistication of the actual substitution process, and will 
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Fig. 8 Likelihood surfaces behaving badly. Schematic of the probability surface 
of the function p(X10) is plotted as a function of d. Most line search strategies will 
converge (CV) to the MLE when the initial value is in the “CV” interval, and fail 
when it is in the “no CV” interval. Adapted with permission from [54] 


therefore be highly biased even if all the parameters can be esti- 
mated with tight precision (little variance). Alternatively, a highly 
parameterized model will “spread” the information available from 
the data over a large number of parameters, hereby making their 
estimation difficult (flat likelihood surface; see Subheading 2.8.1), 
with a large variance, in spite of perhaps being a more realistic 
model with less bias. The objective of most model selection proce- 
dure is therefore to find not the Jest model in terms of likelihood 
score, but the most appropriate model, the one that strikes the right 
balance between bias and variance in terms of number of para- 
meters. However, we argue that optimizing for this bias—variance 
trade-off works only for statistical procedures, be they, for instance, 
frequentist (LRT, likelihood ratio test) or Bayesian (BF, Bayes 
factor). On the other hand, information-theoretic criteria (e.g., 
AIC, Akaike information criterion) aim at selecting the model 
that is approximately closest to the “true” biological process. 

The bias—variance trade-off mainly concerns the comparison of 
models that are based on the same underlying rationale, for 
instance, choosing among the 203 models that can be derived 
from GTR. We may also be interested in comparing models that 
are based on very different rationales. The likelihood ratio test is 
suited for assessing the bias—variance trade-off, while Bayesian and 
information-theoretic approaches, as well as cross-validation (CV), 
can be used for more general model comparisons. Here we review 
four approaches to model selection: LRT, BF, AIC, and CV. 


The substitution models presented above have one key property: it 
is possible to reduce the most sophisticated time-reversible named 
model (GTR+/°+I) to any simpler model by imposing some con- 
straints on parameters. As a result, the models are said to be nested, 
and statistical theory (the Neyman—Pearson lemma) tells us that 
there is an optimal (most powerful) way of comparing two nested 
models (a simple null vs. a simple alternative hypothesis) based on 
the likelihood ratio test or LRT. 
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The test statistic of the LRT is twice the log-likelihood differ- 
ence between the most sophisticated model (which by definition is 
always the one with the highest likelihood—if this is not the case, 
there is a convergence issue; see Subheading 2.8.1) and the simpler 
model. This test statistic follows asymptotically a y” distribution 
(under certain regularity conditions), and the degree of freedom of 
the test is equal to the difference in the number of free parameters 
between the two models. 

The null hypothesis is that the two competing models explain 
the data equally well. The alternative is that the most sophisticated 
model explains the data better than the simpler model. If the null 
hypothesis cannot be rejected at a certain level (type-I error rate), 
then, based on the argument developed above, the simpler model 
should be used to analyze the data. Otherwise, if the null hypothe- 
sis can be rejected, the more sophisticated model should be used to 
analyze the data. Note that a test never leads to accepting a null 
hypothesis; the only outcomes of a test are either reject or fail to 
reject a null hypothesis. 

Intuitively, we can see the null hypothesis Hp as stating that a 
certain parameter @ is equal to 0o. The maximum likelihood 
estimate (MLE) is at ĝ, which is our alternative hypothesis Hy, 
left unspecified. We note the log-likelihood as In p(X|@) = €(6), 
where X represents the data. Under Hop, we have 0 = ĝo, while 
under Hy we have 9 = ĝ. The log-likelihood ratio is therefore 


In LR = €(0) — €(09). Under the null Ho, £(0) = 0 (by definition). 
The log-likelihood ratio then reduces to In LR = —£(09). We can 
then take the Taylor expansion of the log-likelihood function f 


around ĝ, which gives us £ ~ 1@ — gut? E (recall that €(0) = 0, 


so that the first terms of the series “disappear”). Therefore, 


log-likelihood ratio can be approximated by — 4 (8 — dal SCH Recall 
that Fisher’s information is negative reciprocal of the second deriv- 


ative of the likelihood function, so that: 


À 2 
pinat O u 


2 var(0) ao) 


which follows asymptotically half a y? distribution. Hence the usual 
approximation: 


2iInLR=2 x (& = £0) ~ (17) 


with k being the difference in the number of free parameters 
between the two models 0 and 1. The important points in this 
intuitive outline of the proof are that (1) the two hypotheses need 
to be nested and (2) taking the Taylor expansion around d requires 
that the likelihood function be continuous at that point, which 
implies that £ is differentiable left and right of 9. Therefore, testing 
points at the boundary of the parameter space cannot be done by 
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approximating the distribution of the test statistic of the LRT by a 
regular y distribution, as noted many times in molecular evolution 
[56-64]. A solution still involves the LRT, but the asymptotic 
distribution becomes a mixture of zs? distributions [65]. 

An approach that has become popular under the widespread 
adoption of computer programs such as ModelTest [66] and 
jModeltTest [67] is the hierarchical LRT (hLRT). This hierarchy 
goes from the simplest model (JC) to the set of most complex 
models (+/°+I), traversing a tree of models. The issue is that there 
is more than one way to traverse this tree of models, and that 
depending on which way is adopted, the procedure may end up 
selecting different models [68, 69]. 


Information theory provides us with a number of solutions to 
circumvent the three limitations of the LRT (nestedness, continu- 
ity, and dependency on the order in which models are compared). 

The core of the information-based approach is the 
Kullback—Leibler (KL) distance, or information [70], which mea- 
sures the distance between an approximating model g and a “true” 
model f [55]. This distance is computed as: 

F(x) 
where @ is a vector of parameters entering the approximating 
model g and x represents the data. Note that this distance is not 
symmetric, as typically deu (E al Æ dıl s, f), and that the “true” 
model fis unknown. The idea is to rewrite dyy( f, g) in a slightly 
different form, to make it clear that Eq. 18 is actually a difference 


between two expectations, both taken with respect to the unknown 
“truth” € 


de fal Elf (x) In f(x)] — El f(x) Ing(xlo)] (19) 


Equation 19 therefore measures the loss of information incurred by 
fitting g when the data x actually come from f. As fis unknown, 
Gett f, g) cannot be computed as such. 

Two points are key to deriving the criterion proposed by Akaike 
(see [55]). First, we usually want to compare at least two approx- 
imating models, g and gı. We can then measure which one is 
closest to the “true” process f by taking the difference between 
their respective Kullback—Leibler distances. In the process, the 
direct reference to the “true” process cancels out. As a result, the 
“best” model among go and gı is the one that is closest to the 
“true” process f it is the model that minimizes the distance to f. By 
setting model parameters to their MLEs, we now deal with esti- 
mated distances, but these are still with respect to the unknown f. 

Second, in the context of a frequentist approach, we would 
repeat the experiment of sampling data an infinite number of times. 


ax (18) 
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We would then compute the expected estimated KL distance, so that 
model selection can be done on the sole estimated log-likelihood 
value. Akaike, however, showed that this latter approximation is 
biased, and must be adjusted by a term that is approximately equal 
to the number of parameters K entering model g (see [55]). For 
“historical reasons” (similarity with asymptotic theory with the 
normal distribution), the selection criterion is multiplied by 2 to 
give the well-known definition of the Akaike information criterion 
or AIC: 


AIC = —2Iné(0) + 2K (20) 
Unlike the case of the hLRT, where we were selecting the “most 
appropriate model” (with respect to the bias—variance trade-off), in 
the case of AIC we can select the best model. This best model is the 
one that is closest to the “true” unknown model (f), with the 
smallest relative estimated expected KL distance. The best AIC 
model therefore minimizes the criterion in Eq. 20. 

A small-sample second-order version of AIC exists, where the 
penalty for extra parameters (2K in Eq. 20) is slightly modified to 
account for the trade-off between information content in the data 
and K (see [55]). In our experience, we find it advisable to use this 
small-sample correction irrespective of the actual size of the data, 
since this correction vanishes in large and informative samples, but 
corrects for proper model ranking when K becomes very large 
compared to the amount of information (e.g., in phylogenomics 
where models are partitioned with respect to hundreds of genes). 

The AIC has been shown to tend to favor parameter-rich 
models [71-75], which has motivated the use and development 
of alternative approaches in computational molecular evolution. 
These include, the Bayesian information criterion [76], and the 
decision theory or DT approach, which is based on AAIC weighted 
by squared branch length differences [71]. Most of these 
approaches, including the hLRT, have recently been compared in 
a simulation study that suggests, in agreement with empirical stud- 
ies [72, 77], that both BIC and DT have the highest accuracy and 
precision [75]. 

One particular drawback of these information-theoretic 
approaches is that they require that every single model of evolution, 
or at least the most “popular” models (the few named ones), be 
evaluated. This step can be time-consuming, especially if a full 
maximum likelihood optimization is performed under each 
model. A first set of heuristics consists in fixing the tree topology 
to a tree estimated with a quick distance-based method such as 
BioNJ [78], and then estimating just the branch lengths and the 
parameters of the substitution model, as implemented in 
jModelTest [67]. As the optimizations are independent of each 
other under each substitution model, these computations are 
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typically forked to multiple cores or processors [79 ]. Further heur- 
istics exist to avoid all these independent optimizations [79], as 
implemented in SMS (Smart Model Selection in PhyML), which is 
reported to be cutting runtimes in half without forfeiting 
accuracy [80]. 

Note finally that all these approaches are not limited to select- 
ing the most appropriate or the best model of evolution. Disregard- 
ing the hLRT, which requires that models be nested (to be able to 
use the y” approximation; otherwise, see [65]), AIC, BIC, etc. allow 
us to compare non-nested models and, in particular, phylogenetic 
trees (branch lengths plus topology). 


The Bayesian framework has permitted the development of two 
main approaches, which are actually two sides of the same coin: one 
based on finding the model that is the most probable a posteriori, 
and one based on ranking models and estimating a quantity called 
the Bayes factor. 

In a nutshell, the frequentist approaches developed in the 
previous sections are based on the likelihood, which is the proba- 
bility of the data, given the parameters: p(X|0). However, this 
approach may not be the most intuitive, since most practitioners 
are not interested in knowing the conditional probability of their 
data, as the data were collected to learn more about the processes 
that generated them. It can therefore be argued that the Bayesian 
approach, which considers the probability of the parameters given 
the data or p(6|X), is more intuitive than the frequentist approach. 
Unlike likelihood, which relies on the function p(X|@) and permits 
point estimation, Bayesian inference is based on the posterior dis- 
tribution p(6|X). This distribution is often summarized by a cen- 
trality measure such as its mode, mean, or median. Measures of 
uncertainty are based on credibility intervals, the Bayesian equiva- 
lent of confidence intervals. Typically, credibility intervals are taken 
at the 95% cutoff and are called highest posterior densities (HPDs). 

The connection between posterior probability and likelihood is 
made with Bayes’ inversion formula, also called Bayes’ theorem, by 
means of a quantity called the prior distribution p(@): 


SC) p0) 

(01x) == (21) 
The prior represents what we think about the process that gener- 
ated the data, before analyzing the data, and is at the origin of all 
controversies surrounding Bayesian inference. In practice, priors 
are more typically chosen based on statistical convenience, and 
often have nothing to do with our genuine state of knowledge 
about parameters before observing the available data. We will see 
in Subheading 3.1 that priors can be used to distinguish between 
parameters that are confounded in a maximum likelihood analysis 
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(model), so that we argue that the frequentist vs. Bayesian contro- 
versy is sterile, and we advocate a more pragmatic approach, that 
often results in the mixing of both approaches (in their concepts 
and techniques) [81, 82]. 

All models have parameters. Subheading 2.7 treats substitution 
models, which can have eight free parameters in the case of GTR 
+ I’. Most people are not really interested in these parameters 0 or 
in their estimates 6, but have to use them in order to estimate a 
phylogenetic tree t. These parameters 0 are called nuisance para- 
meters because they enter the model but are not the focus of 
inference. The likelihood solution consists in setting these para- 
meters to their MLE, ignoring the uncertainty with which they can 
be estimated, while the Bayesian approach will integrate them out, 
directly accounting for their uncertainty: 


SE = | axr 0)p(0) dé (22) 


One difficulty in Bayesian inference is about the denominator 
in Eq. 21, as this denominator often has no analytical solution. In 
spite of being a normalizing constant, p(X) requires integrating out 
nuisance parameters by means of prior distributions as in Eq. 22. 
Thus, it is easy to see from Eq. 21 that the posterior distribution of 
the variable of interest (e.g., 7) can quickly become complicated: 


ax) — | Ale ®) pC) 2) 
Pl®) | eee ai (28) 


where t and @ are assumed to be independent and the discrete sum 
is taken over the set T of all possible topologies (see Subheading 
2.10.1). However, the ratio of posteriors evaluated at two different 
points will simplify: as the denominator in Eq. 23 is a constant, it 
will cancel out from the ratio. This simple observation is at the 
origin of an integration technique for approximating the posterior 
distribution in Eq. 23: Markov chain Monte Carlo (MCMC) sam- 
plers. A very clear introduction can be found in [83]. 

Building on this, two approaches can be formulated to compare 
models in a Bayesian framework. The first is to treat the model as a 
“random variable,” and compute its posterior probability. The best 
model is then the one that has the highest posterior probability. 
This approach is typically implemented in a reversible-jump 
MCMC (or ry3MCMC) sampler (e.g., see [49 ]). 

The alternative is to use the Bayesian equivalent of the LRT, the 
Bayes factor. Rather than comparing two likelihoods, the Bayes 
factor compares the probability of the data under two models, Mo 
and Ai: 


(24) 
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More specifically, BFo 1 evaluates the weight of evidence in favor of 
model Mo against model Mı, with BFo > 1 considered as evi- 
dence in favor of Mo. Just as in a frequentist context, where a null 
hypothesis is significantly rejected at a certain threshold, 5%, 1%, or 
less depending on different costs or error types, Bayes factors can be 
evaluated on a specific scale [84]. However, because this scale is just 
as ad hoc as in a frequentist setting, it might be preferable to use the 
probability of the data under a particular model p( X| M;) as a means 
of ranking models A4. 

The quantity p(X|Mo), which is the denominator in Eq. 23 
(where we did not include the dependence on the model in the 
notation), is called the marginal likelihood. Note that it is also an 
expectation with respect to a prior probability distribution: 


p(X|Mo) = | DEI, Mo) p(@|Mo) d0 (25) 
o 

A number of approximations to evaluate Eq. 25 exist and are 
reviewed in [85] (see also [86, 87]). The simplest one is based on 
the harmonic mean of the likelihood sampled from the posterior 
distribution [88], also known as the harmonic mean estimator 
(HME). The way this estimator is derived demands to understand 
how integrals can be approximated. Briefly, to compute 
I = [g(0) p(0) dO, generate a sample from a distribution p*(0) 
and calculate the simulation-consistent estimator 
I=% wi g(9)/ >>; where w;is the importance function p(0)/ 
p*(0). Take g= p(X|0) and p*(0) = #X|0) p(0)/p(X), then 
Î = P(X|Mo) = limy (4D sde) ` with d ~ p(X) (see sup- 
plementary information in [89]). As a result, a very simple way to 
estimate the marginal likelihood and Bayes factors is to take the 
output of an MCMC sampler and compute the harmonic mean of 
the likelihood values (not the log-likelihood values) sampled from 
the posterior distribution. 

Because of its simplicity, this estimator is now implemented in 
most popular programs such as MrBayes [90] or BEAST [91]. 
However, it might be considered as the worst estimator possible, 
because its results are unstable [88, 92] and biased towards the 
selection of parameter-rich models [86]. An alternative and reliable 
estimator, based on thermodynamic integration (TI; [86]—also 
known as path sampling; [93, 94]), is much more demanding in 
terms of computation. Indeed, it requires running MCMC sam- 
plers morphing one model into the other (and vice versa), which 
can increase computation time by up to an order of magnitude 
[86]. Improvements of the TI estimator are however available. The 
stepping-stone (SS) approach builds on importance sampling and 
TI to speed up the computation while maintaining the accuracy of 
the standard TI estimator [87, 95]. 
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2.9.4 Cross-Validation 


Moving away from the estimation of marginal likelihoods, an 
analogue of AIC that can be obtained through the output of an 
MCMC sampler (AICM) was proposed [96]. In essence, it relies on 
the asymptotic convergence of the posterior distribution of the 
log-likelihood on a gamma distribution [97]. As such, it becomes 
possible to estimate the effective number of parameters as twice the 
sample variance of posterior distribution of the log-likelihood, 
which itself can be estimated by a resampling procedure 
[96]. This gives a very elegant means of estimating AIC, from the 
posterior simulations. However, although AICM seems to be a 
more stable measure of model ranking than HME, both TI and 
SS still seem to outperform this estimator, at least in the case of the 
comparison of demographic and relaxed molecular clock models 
[96] (see Subheading 3). 


Cross-validation is another model selection approach, which is 
extremely versatile in that it can be used to compare any set of 
models of interest. Besides, the approach is very intuitive. In its 
simplest form, cross-validation consists in dividing the available 
data into two sets, one used for “training” and the other one used 
for “validating.” In the training step (TS), the model of interest is 
fitted to the training data in order to obtain a set of MLEs. These 
MLEs are then used to compute the likelihood using the validation 
data (validation step, VS). Because the validation data were not part 
of the training data, the likelihood values computed during VS can 
be directly used to compare models, without requiring any explicit 
correction for model dimensionality. 

The robustness of the cross-validation scores can be explored in 
various ways, such as repeating the above procedure with a switched 
labeling of training and validation data (hence the expression cross- 
validation). Of course, this simple 2-fold cross-validation could be 
extended to n-fold cross-validation, where the data are subdivided 
into 7 subsets, with n — 1 subsets serving for training, and one for 
validation. Ideally, the procedure is repeated n — l additional 
times. 

We know of only two examples of its use in phylogenetics, one 
in the ML framework [98] and one with a Bayesian approach [99]. 
Given the increasing size of modern data sets, putting aside some of 
the data for validation is probably not going to dramatically affect 
the information content of the whole data set. As a result, model 
selection via cross-validation, which is statistically sound, could 
become a very popular approach. 


A Not-So-Long Introduction to Computational Molecular Evolution 93 


2.10 Finding the Best Now that we can select a model of evolution (Subheading 2.9) and 
Tree Topology estimate model parameters (Subheading 2.8) under a particular 
model (Subheading 2.5), how do we find the optimal tree? The 
basic example in Subheading 2.1 suggested that we score all possi- 
ble tree topologies and choose for inference the one that has the 
highest score. However, a simple counting exercise shows that an 
exhaustive examination of all possible topologies is not realistic. 

Figure 9 shows how to count tree topologies. Starting from the 
simplest possible unrooted tree, with three taxa, there are three 
positions where a fourth branch (leading to a fourth taxon) can be 
added. As a result, there are three possible topologies with four 
taxa. For each of these, there are four places on the tree where a fifth 
branch can be added, which leads to a total of 3 x 5 = 15 topol- 
ogies with five taxa. A recursion appears immediately, and it can be 
shown that the total number of unrooted topologies with 7 taxa is 
equal to 1 x 3 x--- x 2m — 5 [100] (see [15] for the deeper his- 
tory), which, as given in [101], is equal to: 


2.10.1 Counting Trees 


n—2 3 
T(n) Ga=5 Te (26) 


unrooted Eet = 3)! Va 


where the I function for any real number x is defined as 
r(x) = fo Te" dt. An approximation based on Stirling num- 
ber is also given in [101]. 

The same exercise can be done for rooted trees (Fig. 10), where 


the number of possible rooted topologies with m taxa becomes 
1 x 3 x:-+ x 2” — 3, which is 


3 taxa 4 taxa 5 taxa 
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Fig. 9 Procedure to count the number of unrooted topologies. The top line shows the current number of taxa 
included in the tree below. Gray arrows indicate locations where an additional branch can be grafted to add 
one taxon. Black arrows show the resulting number of topologies after addition of a branch (taxon). Only one 
such possible topology is represented at the next step. The bottom line indicates the number of possibilities. 
These numbers multiply to obtain the total number of trees 
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Table 2 
Counting tree topologies 


Number of taxa Unrooted tree Rooted trees 

3 1 3 

4 3 15 

5 15 105 

6 105 945 

10 2,027,025 34,459,425 

20 221,643,095,476,699,771,875 8,200,794,532,637,891,559,375 


Number of tree topologies are given for the unrooted and rooted cases 
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Fig. 10 Procedure to count the number of rooted topologies. See Fig. 9 for legend and text for details 


2.10.2 Some Heuristics 
to Find the Best Tree 


1 
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Note that N SC =N Se ) , as Table 2 clearly suggests. 


As a result, the number of possible topologies quickly becomes 
very large when the number ~ of sequences increases, even with a 
very modest y, so that heuristics become necessary to find the best- 
scoring tree. 


The simplest approach builds upon the idea presented in Figs. 9 
and 10. Stepwise addition, for instance, starts with three sequences 
drawn at random among the 7 sequences to be analyzed, and adds 
sequences one at a time, keeping only the tree that has the highest 
score at each step (e.g., [52 ]). However, there is no guarantee that 
the final tree is the optimal tree [44]. The idea behind branch-and- 
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bound [102], refined in [103], is to have a look-ahead routine that 
prevents entrapment in suboptimal trees. This routine sets a bound 
on the trees selected at each round of additions, such that only the 
trees that have a score at least as good as that of the trees obtained in 
the next round are kept in the search algorithm. Solutions found by 
the branch-and-bound algorithm are optimal, but computing time 
becomes quickly prohibitive with more than 20 sequences. 

As a result, most tree-search algorithms will start with a quickly 
obtained tree, often reconstructed with an algorithm based on 
pairwise distances such as neighbor-joining [104] or a related 
approach [78, 105], and then alter the tree randomly until no 
further improvement is obtained or after a certain number of 
unsuccessful attempts are reached. Examples of such algorithms 
include nearest neighbor interchange (NNI), subtree pruning and 
regrafting (SPR), or tree bisection and reconnection (TBR), see, 
for instance, [6] for a full description. While the details are of little 
importance here, the critical point is the extent of topological 
rearrangement in each case. With, e.g., NNI, each rearrangement 
can give rise to two topologies. The result is that exploring the 
topology space is slow, especially in problems with large 7. On the 
other hand, TBR has, among the three methods cited above, the 
largest number of neighbors. As a result, the topology space is 
explored quickly, but the optimal tree can be “missed” simply 
because a dramatic change is attempted, so that the computational 
cost increases. Alternatively, the chance of finding the optimal tree z 
when 7 is very different from the current tree is higher when the 
algorithm can create some dramatic rearrangements. Some pro- 
grams, such as PhyML ver. 3.0, now use a combination of NNI 
and SPR to address this issue [24]. MCMC samplers that search the 
tree space implement somewhat similar tree-perturbation algo- 
rithms that are either “global,” and modify the topology dramati- 
cally, or “local” [106] (see also [107 | for a correction of the original 
local moves). As a result, MCMC samplers are affected by the same 
issues as traditional likelihood methods. Much of the difficulty 
therefore comes from this kind of trade-off between larger rearran- 
gements that are expected to improve accuracy and the computa- 
tional burden associated with these extra computations [108]. 


As some of the above computations can become quite costly (high 
runtimes, heavy memory footprints, poor scalability with large data 
sets, etc.), computational workarounds have been and are being 
explored. One of these resorts to approximate Bayes computing 
(ABC), which is essentially a likelihood-free approach. First devel- 
oped in the context of population genetics [109, 110], the driving 
idea is to bypass the optimization procedures and replace them with 
simulations in the context of a rejection sampler. In population 
genetics, the problem could be about a gene tree, which is usually 
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appropriately described by a coalescence tree [111, 112], for which 
we want to estimate some model parameters. As we are able to 
simulate trees from such a process, it is possible to place prior 
distributions on these model parameters, and simulate trees by 
drawing parameters until the simulated trees “look like” the 
observed tree. The set of parameters thus drawn approximates the 
posterior distribution of the corresponding variables. This forms 
the basis of a naive rejection sampler, that is quite flexible as it does 
not even require that a probabilistic model be formulated, but one 
that can be inefficient, especially if the posterior distribution is far 
from the prior distribution—which is usually the case. As a result, a 
number of variations have been described, trying either to correlate 
sample draws as in MCMC samplers [113] or to resample sequen- 
tially from the past [114, 115]. In spite of recent reviews of the 
computational promises and deliveries of ABC samplers 
[116-118], the few applications in molecular evolution have 
been, to date, mostly limited to molecular epidemiology 
[119-122]. One of the major challenges to estimate a phylogenetic 
tree from a sequence alignment with ABC is the lack ofa proper and 
efficient simulation strategy: it is possible to simulate trees under 
various processes (we saw the coalescent above), it is also possible to 
simulate an alignment from a given (possibly simulated tree), so 
that in theory one could imagine an ABC algorithm that would use 
this backward process to estimate phylogenetic trees by comparing 
a simulated alignment to an “actual” alignment. This, however, 
would most likely be a very inefficient sampler. 

A second area that holds promises is the use of artificial intelli- 
gence (AI), and more specifically of machine learning (ML), in 
molecular evolution. Here again, attempts have been made to 
using standard ML approaches such as support vector machines 
[123] to guide the comparison of tree shapes, for instance, [124], 
which can then be used in epidemiology [121], but estimating a 
phylogenetic tree has proved more challenging. In one notable 
exception, an alignment-free distance-based tree-reconstruction 
method was proposed [125], but its main legacy seems to be in 
the development of k-mers, or unaligned sequences chopped into 
words of length k, to reconstruct phylogenetic trees—in particular 
in the context of phylogenomics (phylogenetics at a genomics 
scale) [126, 127]. To the best of our knowledge, nobody has ever 
tried, yet, to train a neural network or even a deep learning algo- 
rithm [128-130] on a database of phylogenetic trees with 
corresponding alignments such as TreeBASE [131] or PANDIT 
[132]. As applications of deep learning start emerging in genomics 
[133] and proteomics [134], it is likely that phylogenetics will 
come next. 
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3 Uncovering Processes and Times 


3.1 Dating the Tree 
of Life: Always 
Deeper? 


3.1.1 The Strict 
Molecular Clock 


Similar to the problem of estimating the tree of life, dating the tree 
of life poses many challenges [135]. Since it was first proposed in 
1965 [40], the idea of estimating divergence times has since under- 
gone a dramatic change, and new approaches are regularly pro- 
posed. Population geneticists have their own approaches, which are 
either fully Bayesian [136] or based on approximate Bayesian com- 
putation in the coalescent framework [137]. All these approaches 
make it possible to infer divergence times between recently 
diverged species, as in the case of humans and chimpanzees, or to 
date demographic events such as the migrations “out of Africa” of 
early human populations [138]. 

In the context of molecular evolution, we are usually interested 
in estimating deeper divergence times, such as those between spe- 
cies, which are available online, for instance, at www.timetree.org 
[139], recently revamped and extended to cover close to 100k 
species [140]. While early “molecular dates” were systematically 
biased towards ages that are too old [135], we argue here that 
recent developments in the field have led to more accurate methods 
and also to a better understanding of methodological limitations. 


One quantity that we can estimate when comparing pairs of 
sequences is the number of differences that exist. This number, 
estimated as a branch length A, can be corrected for multiple sub- 
stitutions (see Subheading 2.7), but basically remains an expected 
number of substitutions per site. With “dating” (defined here as the 
activity of estimating divergence times [141]), we are interested in 
estimating time ż, which relates to the expected numbers of sub- 
stitutions J according to the following equation: 


b=Atxr (28) 


where Atis a period of time and r the rate of evolution. In technical 
terms, times and rates are said to be confounded, because we 
cannot estimate one without making an assumption about the 
other. 

The molecular clock hypothesis does just this by assuming that 
rates of evolution are constant in time [40] (see also [142], p. 65). 
Under this assumption, the estimated tree is ultrametric as in the 
basic example represented in Fig. 11, which implies that all the tips 
are level, or equivalently that the distance from root to tip is the 
same for all branches. 

In this example (Fig. 11), the branch length from the fossil- 
dated node is 0.1 substitutions /site (sub/site), and the fossil was 
estimated to be present 10 million years ago (MYA). Under the 
strict molecular clock assumption (equal rates over the whole tree), 
we can (1) estimate the rate of evolution (0.1/10 = 0.01 
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fossil dated at 10 MA NN 
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Fig. 11 The strict molecular clock. The tree is ultrametric. The node marked with a star indicates the presence 
of a fossil, dated in this example to 10 million years ago (MYA). This is the point that we will use to calibrate the 
clock, that is, to estimate the global rate of evolution. The number of substitutions that accumulated from the 
marked node to the tips (present) is indicated on the right weights in at 0.1 substitutions/site. The node that is 
the most recent common ancestor of S2 and S5 is the node of interest. The number of substitutions from this 
node to the tips is 0.02 substitutions/site 


sub/site/my) and (2) date all the other nodes on the tree. For 
instance, the most recent common ancestor of S2 and S5 is sepa- 
rated from the tips by a branch length of 0.02 sub/site. Its diver- 
gence time is therefore 0.02/0.01 = 2 MYA. 

As with any hypothesis, the strict clock can be tested. Tests 
based on relative rates assess whether two species evolve at the same 
rate as a third one, used as an outgroup. Originally formulated in a 
distance-based context [143], likelihood versions have been 
described [44, 144]. However, because of their low power [145] 
their use is on the wane. The most powerful test is again the LRT 
(see Subheading 2.9.1). The test proceeds as usual, first calculating 
the test statistic 2A¢ (twice the difference of log-likelihood values). 
The null hypothesis (strict clock) is nested within the alternative 
hypothesis (clock not enforced), so that 2A¢ follows a y? distribu- 
tion. The degree of freedom is calculated following Fig. 12. With 
an alignment of n sequences, we can estimate n — 1 divergence 
times under the null model (disregarding parameters of the substi- 
tution model) and we have 27 — 3 branch lengths under the 
alternative model. The difference in number of free parameters is 
therefore  — 2, which is our degree of freedom. This version of 
the test actually assesses whether all tips are at the same distance 
from the root of the tree [44]. For time-stamped data, serially 
sampled in time as in the case of viruses, the alternative model 
incorporates information on tip dates [146]. 

This linear regression model suggested by the molecular clock 
hypothesis has often been portrayed as a recipe [147], which gave 
rise in the late twentieth to early twenty-first century to a veritable 
cottage industry [148-151], culminating with a paper suggesting 
that the age of the tree of life might be older than the age of planet 
Earth [152]. This recipe was put down by two factors: (1) the 
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Fig. 12 Testing the strict molecular clock. The divergence times that can be estimated under the strict clock 
assumption are denoted ¢;. The branch lengths that can be estimated without the clock are denoted b;. In the 
case depicted, with n = 7 sequences, we have n — 1 = 6 divergence times and 2n — 3 = 11 branch 


lengths 


3.1.2 Local Molecular 
Clocks 


3.1.3 Correlated Relaxed 
Clocks 


publication of a piece written in a rather unusual style for a scientific 
paper [153], and (2) new methodological developments. The main 
points made in [153] are that (1) most of the early dating studies 
relied on one analysis [149] that used a fossil-based calibration 
point for the divergence of birds at 310 MYA to estimate a number 
of molecular dates for vertebrates, and that (2) these molecular 
dates were then used in subsequent studies as a proxy for calibration 
points, disregarding their uncertainty. As a result, estimation errors 
were passed on and amplified from study to study, leading to the 
nonsensical results in [152]. 


This “debacle” has motivated further theoretical developments in 
the dating field. The simplest idea is that, if a global clock does not 
hold for the entire tree, then perhaps groups of related species share 
the same rate. That is, if a global clock does not hold, perhaps the 
tree can be subdivided into /ocal molecular clocks. An initial idea 
was proposed in the context of quartets of sequences [154] and was 
later generalized to a tree of any size with any number of local 
clocks on the tree [155] (constrained by the number of branches on 
the tree and calibration points). Because of the arbitrariness of such 
local clocks, methods have been devised to place the clocks on the 
tree [156] and to estimate the appropriate number of clocks that 
should be used [157]. A Bayesian approach now estimates all these 
parameters and their placement in an integrated statistical 
framework [158]. 


The idea of a correlated relaxed molecular clock goes back to 
Sanderson [159] (see also [160]), who considered that rates of 
evolution can change from branch to branch on a tree. By con- 
straining rates of evolution to vary in an autocorrelated manner on 
a tree, it is possible to devise a method that minimizes the amount 
of rate change. 


100 Stéphane Aris-Brosou and Nicolas Rodrigue 


Prior distribution 
on rates 


The idea of an autocorrelated process governing the evolution 
of the rates of evolution is attributed to [161] in [159], but could 
all the same be attributed to Darwin. Thorne et al. developed this 
idea further in a Bayesian framework [162]. Building upon the 
basic theory covered in Subheading 2.9.3, the idea is to place 
prior distributions on the quantities in the right-hand side of 
Eq. 28. The target distribution is p(t|X). It is proportional to 
pl p(t) according to Bayes’ theorem, but all that we can esti- 
mate is 


DIN p) _ p(Alr, t) prt) (29) 
p(x) p(x) 

One of the possible ways to expand the joint distribution of rates 
and times is p(7, t) is p(r|t) p(t), which posits a process where rate 
change depends on the length of time separating two divergences. 
The “art” is now in choosing prior distributions, conditional on the 
obvious constraint that rates and times should take positive values. 
A number of such prior distributions for rates have been proposed 
and assessed [163] and one of the best-performing model for rates 
is, in our experience, the log-normal model [162, 164]. The prior 
on times is either a pure-birth (Yule) model or a birth-and-death 
process possibly incorporating species sampling effects [165]. If 
sequences are sampled at the population level, a coalescent process 
is more appropriate (see |112 ] for an introduction). In this case, the 
past demography of the sampled sequences can be traced back 
taking inspiration from spline regression techniques [166, 167] or 
multiple change-point models [168 ]. 

Once these priors are specified, an MCMC sampler will draw 
from the target distribution in Eq. 29, and marginal distributions 
for times and rates can easily be obtained. The rationale behind the 
sampler is represented in Fig. 13. As per Eq. 28, the relationship 
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Fig. 13 The relaxed molecular clock. See text for details 


3.1.4 Uncorrelated 
Relaxed Clocks 


3.1.5 Some Applications 
of Relaxed Clock Models 
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between rates and times is the branch of a hyperbolic curve, where 
the priors on rates and on times define a region of higher posterior 
probability, symbolized here by a contour plot superimposed on the 
hyperbolic curve. On top of this, fossil information is incorporated 
into the analysis as constraints on times. A very influential piece 
stimulated a discussion about the shape of these prior distributions 
[153], which was taken up [169], and further developed in [170]. 
Briefly, fossil information is usually imprecise, as paleontologists can 
only provide minimum and maximum ages (Fig. 13). Of these two 
ages, the minimum age is often the most reliable. Under the 
assumption that the placement of the fossil on the tree is correct, 
the idea is to place on fossil dates a prior distribution that will be 
highly skewed towards older (maximum) ages. A “hard bound” can 
be placed on the minimum age, possibly by shifting this prior 
distribution by an offset equal to the minimum age, while the 
tails of the prior distribution will act as “soft bounds,” because 
they do not impose on the tree a strict (or ard) constraint. Empir- 
ical studies agree, however, that both reliability and precision of 
fossil calibrations are critical to estimating divergence times 
[136, 171]. 


Because of the autocorrelation between the rate of each branch and 
that of its ancestral branch (except for the root, which obviously 
requires a special treatment), the tree topology is fixed under the 
autocorrelated models described above. By relaxing this assump- 
tion about rate autocorrelation, [172] were able to implement a 
model that also integrates over topological uncertainty. In spite of 
the somewhat counter-intuitive nature of the relaxation of the 
autocorrelated process, as implemented in BEAST [91, 173], empir- 
ical studies have found this approach to be one of the best- 
performing (e.g., [157]). 

When first published, it was proposed that making use of an 
uncorrelated relaxed molecular clock could improve phylogenetic 
inference [172]. The idea was that calibration points and their 
placement on the tree could act as additional information. How- 
ever, a simulation study suggests that relaxed molecular clocks 
might not improve phylogenetic accuracy [174], a result that 
might be due to the lack of calibration constraints in this particular 
simulation study. 


Since the advent of relaxed molecular clocks, two very exciting 
developments have seen the light of day. The first concerns the 
inclusion of spatial statistics into dating models [175, 176]. Spatial 
statistics are not new in population genetics [177] and have been 
used with success in combination with analyses in computational 
molecular evolution (e.g., [178]). However, the originality in 
[176], for instance, is to combine in a single statistical framework 
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molecular data with geographical and environmental information 
to infer the diffusion of sequences through both space and time. 
While these preliminary models seem to deal appropriately with 
natural barriers to gene flow such as coastlines, a more detailed set 
of constraints on gene flow may further enhance their current 
predictive power. 

The second development coming from relaxed molecular 
clocks concerns the mapping of ancestral characters onto uncertain 
phylogenies. This is not a novel topic, as a Bayesian approach was 
first described in 2004 [179, 180]. The novelty is that we now have 
the tools to correlate morphological and molecular evolution in 
terms of their absolute rates and to allow both molecular and 
morphological rates of evolution to vary in time [181]. Further 
development will certainly integrate over topological uncertainty. 
While there has been a heated controversy about the existence of 
such a correlation in the past [182], all previous studies were using 
branch length as a proxy for rate of molecular evolution, which is 
clearly incorrect. We can therefore expect some more accurate 
results on this topic very soon. More details and examples can be 
found in recent and extensive reviews [183-185] that further dis- 
cuss applications to biogeographic studies [186], or extensions to 
viral [187, 188], as well as other types of genomic [189] and 
morphological [190] data. 


4 Molecular Population Phylogenomics 


Population genetics is rich in theory regarding the relative roles of 
mutation, drift, and selection. Much research in population geno- 
mics is now focusing on using this theory to develop statistical 
procedures to infer past processes based on population-level data, 
such as those of the 1000-genome project [191], the UK’s 10,000 
genome project [192], and always more ambitious projects [193]. 
One limitation of these inference procedures is that they all focus 
on a thin slice of evolutionary time by studying evolution at the 
level of populations. If we wish to study longer evolutionary time 
scales, for example, tens or hundreds of millions of years, we must 
resort to interspecific data. In such a context, which is becoming 
intrinsically phylogenetic, the most important event is a substitution, 
that is, a mutation that has been fixed. Yet substitution rates can be 
defined from several features. In particular, from a population 
genetics perspective, it is of interest to model both mutational 
features and selective effects, combining them multiplicatively to 
specify substitution rates. We review briefly how substitution mod- 
els that invoke codons as the state space lend themselves naturally to 
these objectives in a first section below (Subheading 4.1), before 
explaining the origin (and a shortcoming) of all the approaches 
developed so far (Subheading 4.2). 


4.1 Bridging the Gap 
Between Population 
Genetics and 
Phylogenetics 


4.2 Origin of 
Mutation-Selection 
Models: The Genic 
Selection Model 
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Assuming a point-mutation process, such that events only change 
one nucleotide of a codon during a small time interval, Muse and 
Gaut proposed a codon substitution model with rates specified 
from the Qgrr nucleotide-level matrix (see Subheading 2.7), 
along with one parameter that modulates synonymous events and 
another one that modulates nonsynonymous events [194]. In most 
subsequent formulations, the parameter associated with synony- 
mous events is assumed to be fixed, such that the model only 
modulates nonsynonymous rates by means of a parameter denoted 
æ. This parameter has traditionally been interpreted as the nonsy- 
nonymous to synonymous rate ratio, and is generally associated 
with a different formulation of the codon model proposed by 
Goldman and Yang [195]. More details on codon models can be 
found in Chapter 4.1 [196]. There continues to be a debate 
regarding the interpretation of the w parameter [197, 198]. Regard- 
less of how this issue is settled, it is clear that œ is aimed at capturing 
the net overall effects of selection, irrespective of the exact nature of 
these effects. 

With the intention to model selective effects themselves, Hal- 
pern and Bruno proposed a codon substitution model that com- 
bines a nucleotide-level layer, as described above, for controlling 
mutational features, along with a fixation factor that is proportional 
to the fixation probability of the mutational event [199]. The 
fixation factor is in turn specified from an account of amino acid 
or codon preferences. One objective of the model, then, consists in 
teasing apart mutation and selection. While [199] proposed their 
model with site-specific fixation factors, later work has explored 
simpler specifications, where all sites have the same fixation factor 
[200]. Other models that aimed at capturing across-site heteroge- 
neities in fixation factors were proposed using nonparametric 
devices and empirical mixtures [201]. Another core idea behind 
these approaches is to construct a more appropriate null model 
against which to test for features of the evolutionary process. This 
idea has been put into practice for the detection of adaptive evolu- 
tion in protein-coding genes [202, 203]. Recent developments 
include sequence-wide fixation factors [9, 197, 204, 205], and we 
predict that these models will play a role in bridging the gap 
between molecular evolution at the population and at the species 
levels. 


In order to understand a shortcoming of these models, we need to 
go back to the development of fixation probabilities that took place 
in the second half of the twentieth century. The basic unit or 
quantum of evolution is a change in allele frequency p. Allele 
frequencies can be affected by four processes: migration, mutation, 
selection, and drift. Because of the symmetry between migration 
and mutation [206], which only differ in their magnitude, these 
two processes can be treated as one. We are left with three forces: 
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4.2.1 Fixation 
Probabilities 


mutation, selection, and drift. The question is then, what is the fate 
of an allele under the combined action of these processes? Our 
development here follows [207] (but see [208] for a very clear 
account). 


Of the three processes affecting allele frequencies, mutation and 
selection can be seen as directional forces in that their action will 
shift the distribution of allele frequencies towards a particular point, 
be it an internal equilibrium, or fixation/loss of an allele. On the 
other hand, drift is a non-directional process that will increase the 
variance in allele frequencies across populations, and will therefore 
spread out the distribution of allele frequencies. This distribution is 
denoted ¥( p, t). We also must assume that the magnitude of all 
three processes, mutation, selection, and drift, is small and of the 
order of 4 zn» Where N, is the effective population size. To derive the 
fate of an allele after a certain number of generations, we also need 
to define ai p, e;dt), the probability that allele frequency changes 
from p to p + € during a time interval dz. 

In phylogenetics (and population genetics) we are generally 
interested in predicting the past. The tool making this possible is 
called the Kolmogorov backward equation, which predicts the 
frequency of an allele at some time t, given its frequency po at 
time to: 


P(p,t + dt|py) = J¥(p, dën + el (Po, e dr) de (30) 


We can take the Taylor expansion of Eq. 30 around po, neglect all 
terms whose order is larger than two (0(p9)) and since ¥ is not a 
function of e, we obtain: 


ov aw [ 
Y(p,t + dt|py) = ¥ Ig de + — T Lage: Er EZ 
(31) 


This formulation leads to the definition of two terms that represent 
the directional processes affecting allele frequencies (M) and the 
non-directional process, or drift ( V ): 
M(p) dt = ae de 
(32) 
Vide = fa ede 


that we can substitute into Eq. 31. At equilibrium, Ef S; = Oand, after 
a bit of calculus, we obtain: 


ov - [M dp 
= 3 
EEN Ce JV (33) 


4.2.2 The Case of Genic 
Selection 
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Table 3 
The standard selection models 


Selection coefficients A, Ay A, A, A, A, 
Genic (positive) selection m=l+s m = l + hs w3 = l 
Overdominance mi~ l m=l+s wz = 1 


Models are represented for one locus with two alleles, A; and Az, which define three 
genotypes An Aj, A142, and Az Az of fitness m, m2, and w3. The selection coefficient is 
s (positive in this table, but not necessarily so) and the dominance is governed by 
h(h € [0, 1]) 


for which we need to specify boundary conditions and a model of 
selection. The boundary conditions are the two absorbing states of 
the system: (1) once fixed, an allele remains fixed (#(1, œ; 1) = 1) 
and (2) once lost, an allele remains lost (#(1, 00; 0) = 0). With 
these two requirements, the probability that the allele frequency is 
l given that it was pọ in the distant past is the fixation probability: 


Po — jf dp 
he EE (34) 


he ka dP dp 


We therefore only need to compute M and V under a particular 
model of selection to fully specify the fixation probability ofan allele 
in a mutation-selection-drift system. All that is required now to go 
further is a selection model. 


We are now ready to derive an explicit form to ¥(1, 00; po) in 
Eq. 34 in the case of the genic selection model (Table 3; [209]). 
We obtain: 


W= l + sp + 2pghs =1+2phs + sp’(1 — 2h) (35) 


which can be approximated by 1 + 2phs (the result is exact only 
when 4 = 1/2). Therefore, dw/dp = 2hs, and we can calculate the 
Mand V terms to obtain the popular result: 


2M 
7 E e JS Pay 7 et Nehspo Bal 
e—4tNebs —] 


¥(1, 005) (36) 


fe Sr Pap 


Now, the initial frequency of a mutation in a diploid population 
of (census) size N is pọ = 1/2 N (following [208]; [207] consid- 
ered that po = 1/2.N,; this debate is beyond the scope of this 
chapter), which leads to: 


l e 2Nehs/N | 
sel "om? e 
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If N, is of the order of N, the numerator of the right-hand side of 
Eq. 37 becomes approximately e *” — 1, whose Taylor approxima- 
tion around be = 0 is simply — 24s. We then obtain the result used 
in [199], and in all the papers that implemented mutation-selection 


(-drift) models (e.g., [197, 199-201, 204]): 


l 2hs 
SUE "e e 


Two critical points should be noted here. First, none of the 
recent codon models [197, 199-202, 204, 210, 211] ever investi- 
gated the role of dominance %, as they all consider that the allele 
under (positive) selection is fully dominant. Second, Table 3 shows 
that another class of selection models, those based on balancing 
selection, has never been considered so far. The impact of the 
selection model on the predictions made by the mutation-selection 
(-drift) models is currently unknown. 


5 High-Performance Computing for Phylogenetics 


5.1 Parallelization 


5.2 HPC and Cloud 
Computing 


Because of the dependency of the likelihood computations on the 
shape of a particular tree (see Subheading 2.6), most phylogenetic 
computations cannot be parallelized to take advantage of a multi- 
processor (or multicore) environment. Nevertheless, two main 
directions have been explored to speed up computations: first, in 
computing the likelihood of substitution models that incorporate 
among-site rate variation and second, in distributing bootstrap 
replicates to several processors, as both types of computations 
can be done independently. A third route is explored in 
Chapter 7.4 [212]. 

In the first case, among-site rate variation is usually modeled 
with a I distribution [213] that is discretized over a finite (and 
small) number of categories [214]. The likelihood then takes the 
form of a weighted sum of likelihood functions, one for each 
discrete rate category, so that each of these functions can be eval- 
uated independently. The route most commonly used is the plain 
“embarrassingly parallel” solution, where completely independent 
computations are farmed out to different processors. Such is the 
case for bootstrap replicates, for which a version of PhyML [24] 
exists, or in a Bayesian context for independent MCMC samplers 
[215] (see Subheading 2.9.3). The PhyloBayes-MPI package imple- 
ments distributed likelihood calculations across sites over several 
compute-cores, allowing for a genuinely parallelized MCMC run 
[216, 217]. 


More recent work has focused on the development of heuristics 
that make large-scale phylogenetics amenable to high-performance 
computing (HPC) that are performed on computer clusters. 
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Because of the algorithmic complexity of resolving phylogenetic 
trees, an approach based on “algorithmic engineering” was devel- 
oped [218]. The underlying idea is akin to the training phase in 
supervised machine learning [123], except that here the target is 
not the performance of a classifier but that of search heuristics. All 
of these heuristics reuse parameter estimates, avoid the computa- 
tion of the full likelihood function for all the bootstrap replicates, or 
seed the search algorithm for every 7 replicate on the results of 
previous replicates [218]. For instance, in the “lazy subtree rear- 
rangement” [219], topologies are modified by SPR (see Subhead- 
ing 2.10.2), but instead of recomputing the likelihood on the 
whole tree, only the branch lengths around the perturbation are 
re-optimized. This approximation is used to rank candidate topol- 
ogies, and the actual likelihood is evaluated on the complete tree 
only for the best candidates. These heuristics now permit the 
analysis of thousands of sequences in a probabilistic framework 
[220], but the actual convergence of these algorithms remains 
difficult to evaluate, especially on very large data sets (e.g., >10* 
sequences). 

In addition to the reduction of the memory footprint for sparse 
data matrices [221], an alternative direction to “tweaking likeli- 
hood algorithms” has been to take direct advantage of the comput- 
ing architecture available. One particular effort aims at tapping 
directly into the computing power of graphics processing units or 
GPUs, taking advantage of their shared common memory, their 
highly parallelized architecture, and the comparatively negligible 
cost of spawning and destroying threads on them. As a result, it is 
possible to distribute some of the summation entering the pruning 
algorithm (see Subheading 2.6) to different GPUs [222]. The num- 
ber of programs taking advantage of these developments is widen- 
ing and includes popular options such as BEAST [91] and 
MrBayes [223]. 

All these fast algorithms can be installed on a local computer 
cluster, a solution adopted by many research groups since the late 
1990s. However, installing a cluster can be demanding and costly 
because a dedicated room is required with appropriate cooling and 
power supply (not to mention securing the room, physically). 
Besides, redundancy requirements, both in terms of power supply 
and data storage, as well as basic software maintenance and user 
management, may demand hiring a system administrator. An alter- 
native is to run analyses on a remote HPC server, in the “cloud.” 
Canada, for instance, has a number of such facilities, thanks to 
national funding bodies (CAC at cac.queensu.ca, SHARCNET at 
www.sharcnet.ca, or Calcul Quebec at www.calculquebec.ca, just to 
name a few), and commercial solutions are just a few clicks away 
(e.g., Amazon Elastic Compute Cloud or EC2). Researchers can 
obtain access to these HPC solutions according to a number of 
business models (free, on demand, yearly subscription, etc.) that are 
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associated with a wide spectrum of costs [224]. But in spite of the 
technical support offered in the price, users usually still have to 
install their preferred phylogenetic software manually or put a 
formal request to the team of system administrators managing the 
HPC facility, all of which is not always convenient. 

To make the algorithmic and technological developments 
described above more accessible, the recent past has seen the emer- 
gence of cloud computing [225] dedicated to the phylogenetics 
community. Examples include the CIPRES Science Gateway (www. 
phylo.org), or Phylogeny.fr (www.phylogeny.fr, [226]). Many 
include web portals that do not require that users be well versed 
in Unix commands, while others may include an application pro- 
gramming interface to cater to the most computer-savvy users. One 
potential limitation of these services is the bandwidth necessary to 
transfer large files, and storage requirements—especially in the 
context of next generation sequencing data. The management of 
relatively large files will remain a potential issue, unless phyloge- 
netics practitioners are ready to discard these files after analysis, the 
end product of which is a single tree file a few kilobytes in size, in 
the same way that people involved in genome projects delete the 
original image files produced by massively parallel sequencers. Data 
security or privacy might not be a problem in most applications, 
except in projects dealing with human subjects or viruses such as 
HIV that expose the sexual practices of subjects. However, once 
these various hurdles are out of the way, users could very well 
imagine running their phylogenetic analyses with millions of 
sequences from their smartphone while commuting. 


Although most of the initial applications of likelihood-based meth- 
ods were motivated by the shortcomings of parsimony, they have 
now become well accepted as they constitute principled inference 
approaches that rely on probabilistic logic. Moreover, they allow 
biologists to evaluate more rigorously the relative importance of 
different aspects of evolution. The models presented in this chapter 
have the ability to disentangle rates from times (Subheading 3), or 
mutation from selection (Subheading 4), while in most cases 
accounting for the uncertainty about nuisance parameters. But 
the latest developments described above still make a number of 
restrictive assumptions (Subheading 4.2), and while many varia- 
tions in model formulations can be envisaged, they still remain to 
be explored in practice. 

Although some progress has been made in developing integra- 
tive approaches (e.g., [176, 181]), throughout this chapter we have 
assumed that a reliable alignment was available as a starting point. A 
number of methods exist to co-estimate an alignment and a 
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phylogenetic tree (see Part I of this book), but the computational 
requirements and convergence of some of these approaches can be 
daunting, even on the smallest data sets by today’s standards. 

This brings us, finally, to the issue of tractability of most of 
these models in the face of very large data sets. The field of phylo- 
genomics is developing quickly (see Part ITT), at a pace that is ever 
increasing given the output rate of whole genome sequencing 
projects. Environmental questions are drawing more and more 
attention, and metagenomes (see Part VI) will be analyzed in the 
context of what will soon be called metaphylogenomics. Exploring 
the numerous available and foreseeable substitution models in such 
contexts will require continued work in computational methodol- 
ogies. As such, modeling efforts will continue to go hand-in-hand 
with, and maybe dependent on, algorithmic developments [227]. It 
is also not impossible that in the near future, the use of likelihood- 
free approach such as ABC or machine learning algorithms in 
computational molecular evolution be more thoroughly explored. 
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Abstract 


Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level 
between two or more genomes. It combines aspects of both colinear sequence alignment and gene 
orthology prediction and is typically more challenging to address than either of these tasks due to the 
size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been 
developed for its solution because WGAs are valuable for genome-wide analyses such as phylogenetic 
inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and 
significance of WGA and present an overview of the methods that address it. We also examine the problem 
of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in 
order to make most effective use of our rapidly growing databases of whole genomes. 


Key words Sequence alignment, Whole-genome alignment, Homology map, Toporthology, 
Genome evolution, Comparative genomics 


1 Introduction 


When the problem of biological sequence alignment was first 
described and addressed in the 1970s, sequencing technology was 
limited to obtaining the sequences of individual proteins or 
mRNAs or short genomic intervals. As such, classical sequence 
alignment (as described in Chapter 7 [1]) is typically focused on 
predicting homologous positions within two or more relatively 
short and colinear sequences, allowing for the edit events of substi- 
tution, insertion, and deletion. Although limited in its scope, this 
type of alignment remains extremely important today, with gene- 
sized alignments forming the basis of most evolutionary studies. 
Starting in 1995 with the sequencing of the 1.8 Mb-sized 
genome of the bacterium H. influenzae [2], biologists have had 
access to a different scale of biological sequences, those of whole 
genomes. DNA sequencing technology has rapidly improved since 
that time, and as a result, we have seen an explosion in the availabil- 
ity of whole-genome sequences. As of the writing of this chapter, 
there are 9071 published complete genome sequences (8380 
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bacterial, 281 archaeal, and 410 eukaryotic), according to the 
GOLD database [3]. Whole-genome sequencing remains popular, 
with over 140,000 sequencing projects that are either ongoing or 
completed. 

Along with the ascertainment of these sequences, the problem 
of whole-genome alignment (WGA) has arisen. As each genome is 
sequenced, there is interest in aligning it against other available 
genomes in order to better understand its evolutionary history and, 
ultimately, the biology of its species. Like classical sequence align- 
ment, WGA is about predicting evolutionarily related sequence 
positions. However, aligning whole genomes is made more com- 
plicated by the fact that genomes undergo large-scale structural 
changes, such as duplications and rearrangements. In addition, a 
set of genomes may contain pairs of sequence positions whose 
evolutionary relationships can be described by any of the three 
major subclasses of homology: orthology, paralogy, and xenology. 
As orthologous positions are typically of primary interest, WGA 
also involves the classification of homologous relationships. 

In this chapter, we describe the problem of WGA and the 
methods that address it. We begin with a thorough definition of 
the problem and discuss the important downstream applications of 
WGAs. We then categorize the WGA methods that have been 
developed and describe the key computational techniques that are 
used within each category. In addition to describing whole-genome 
aligners, we also discuss the various approaches that have been used 
for evaluating the alignments they produce. Lastly, we lay out a 
number of current methodological challenges for WGA. 


2 The Definition and Significance of WGA 


2.1 WGAasa 
Correspondence 
Between Genomes 


In imprecise terms, a WGA is a “correspondence” between gen- 
omes. For each segment of a given genome, a WGA tells us where 
its “corresponding” segments are in other genomes. A segment 
may be one or more contiguous nucleotide positions within a 
genome. What does it mean for two genomic segments to “corre- 
spond” to each other? In most situations, we consider two seg- 
ments to be “corresponding” if they are orthologous. Orthologous 
sequences are those that are evolutionarily related (homologous) 
and that diverged from their most recent common ancestor 
(MRCA) due to a speciation event [4]. In contrast, paralogous 
sequences are homologs that diverged from the MRCA due to a 
duplication event. Thus, by definition, orthologous sequences are 
the most closely related pieces of two genomes and, as is more 
thoroughly discussed later and in Chapter 9 [5], are of primary 
interest because they are useful for applications such as function 
prediction and species tree inference. As such, WGA is most com- 
monly taken to be the prediction of orthology between the 


2.2 Toporthology 
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components of entire genome sequences. When a WGA also pre- 
dicts paralogy, typically only paralogs whose MRCA is at least as 
recent as the MRCA of entire set of genomes are considered, as 
there is extensive ancient homology within extant genomes. 

It is important to note that the orthologous relationships 
between two genomes do not create a one-to-one correspondence. 
Duplication events that have occurred since the time of the MRCA 
of the species can result in a genomic segment in one species having 
multiple orthologous segments in another. This is a particularly 
important issue when the genome of one lineage has undergone a 
whole-genome duplication event since the time of the MRCA. In 
this situation, few segments of the genome of the nonduplicated 
lineage have a single ortholog in the other genome. 


In many cases, WGAs do not aim to predict all orthologous 
sequences. Instead, they only predict toporthology (positional 
orthology), a distinguished subset of orthology [6, 7]. The concept 
of toporthology captures the notion that not all orthologous rela- 
tionships are equivalent in terms of the evolutionary history of the 
genomic context of the orthologs. Figure 1 gives an example 
scenario in which toporthology helps to distinguish between two 
orthologous relationships. 

The definition of toporthology relies on a classification of 
duplication events. A duplication event is considered to be “sym- 
metric” if the removal of either copy of the duplicated genomic 
material (immediately after the event) reverts the genome to its 
original (preduplication) state. Examples of symmetric duplications 
are tandem and whole-genome duplications. If only one specific 
copy can be removed to undo a duplication event, then the event is 
considered “asymmetric.” In the asymmetric case, the removable 
copy is referred to as the “target,” with the other copy referred to as 
the “source.” Retrotransposition and segmental duplication both 
belong to the asymmetric class. 

With this classification of duplication events in hand, we can 
now define toporthology. Two genomic segments are toportholo- 
gous if they are orthologous and neither segment is derived from 
the target of an asymmetric duplication event since the time of the 
MRCA of the segments. Thus, two orthologous segments are 
toporthologous if their evolutionary history (since the MRCA) 
only involves symmetric duplication events or asymmetric duplica- 
tions in which their ancestral segment was part of the source copy. 

The important property of toporthologs is that, in the absence 
of rearrangement events, they share the same ancestral genomic 
context. As the context of a gene or genomic segment has func- 
tional consequences, toporthologous sequences are generally 
expected to be more similar in their function than orthologous 
sequences that are not toporthologous (atoporthologs) [6]. How- 
ever, there is no guarantee that toporthologs share a common 
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Species A 


Ancestor 


asymmetric 
duplication 


Species B 


Fig. 1 A hypothetical evolutionary scenario in which the relation of toporthology distinguishes between two 
ortholog pairs. The bullet-like shapes indicate genomic segments. Both YB1 and YB2 are orthologous to 
YA. However, only YB1 is toporthologous to YA because YB2 was derived from the target of an asymmetric 
duplication since the time of the most recent common ancestor, Y, of YB2 and YA 


2.3 Definition and 
Representation 


function or that two genomic intervals that have the same function 
are toporthologs. Thus, a rigorous functional analysis of genomes 
should consider all classes of homology. Nevertheless, WGAs that 
focus on toporthology produce a good first approximation to a 
functional correspondence between genomes. 


To be more precise, a WGA is, in general, the prediction of homol- 
ogous pairs of positions between two or more genome sequences. 
Often, as we have previously discussed, only orthologous or 
toporthologous relations are predicted in WGAs. And while align- 
ment is typically focused on homologous relationships between 
sequences, whole-genome comparisons can also include alignments 
within genomes, which represent paralogous sequences. 

Note that we define WGA as homology prediction at the level 
of nucleotides. Although the concept of homology is more com- 
monly used with respect to entire genes or proteins, it is easily used 
and, in fact, more naturally defined at the level of single nucleotides. 
Homology of nucleotide positions is established through template- 
driven nucleotide synthesis, and the definitions of orthology, paral- 
ogy, and xenology for nucleotides follow those for genes [7]. 

While a WGA can be defined as a prediction of homology 
statements, it is usually represented as a set of nucleotide-level 
alignment matrices or “blocks,” each block made up by segments 
of the genomes that are both homologous and colinear. Homolo- 
gous genomic segments are colinear if they have not been broken 
by a rearrangement event since the time of their MRCA. Since 
rearrangement events, such as inversions, are common at the scale 
of entire genomes, WGAs are typically made up of many blocks. In 
general, a block contains two or more genomic segments, and 
multiple segments in the same block may belong to the same 
genome (indicating paralogous sequence). One specific WGA rep- 
resentation, the “threaded blockset” [8], requires that every 
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Species A 
Species B 
Species C 


A TT-CTAAGTG 
B CTACTAAG-G 
Cl CTACT--GTG 
C2 CTACC--GTG 


Fig. 2 An example WGA of three genomes represented as a set of alignment blocks. (a) The positions of the 
genomic segments that are in the alignment blocks are shown as shaded bullet-like shapes (the direction of 
the bullet indicates the orientation of the segment). In this example, not all genomic segments belong to a 
block (note the unshaded intervals). (b) The alignment blocks of the WGA. Note that blocks do not need to 
contain a segment from all genomes (e.g., block Y) and that some blocks can contain multiple segments from 
the same genome (e.g., blocks X and Z). (c) A slice of alignment block Z, which is a nucleotide-level alignment 


2.4 Comparison to 
Other Homology 
Prediction Tasks 


position belongs to a block and thus additionally allows a block to 
contain just a single segment, which would represent a unique 
genomic sequence. Figure 2 depicts a hypothetical example of a 
WGA, with some blocks containing both orthologous and para- 
logous sequences. 

As more genomes are added to an alignment or the total 
evolutionary divergence between them is increased, the blocks in 
a WGA decrease in size and increase in number. One might imagine 
that in the limit of an infinite number of genomes or an infinite 
amount of time, all blocks might have length one (a single column), 
which makes the concept of an “alignment matrix” irrelevant. 
However, rearrangements in certain segments of the genome are 
likely to be highly deleterious to an organism and will thus never be 
observed. Such segments are referred to as genomic “atoms” [9] 
and prevent all blocks from becoming single alignment columns. 


WGA is closely related to classical sequence alignment (the align- 
ment of two or more relatively short and colinear sequences), and 
most whole-genome aligners rely on classical alignment techniques 
(e.g., the Needleman—Wunsch [10] and Smith-Waterman [11] 
pairwise alignment algorithms and heuristics used for multiple 
alignments) as subroutines. However, there are three key differ- 
ences between these two classes of alignment. First, and most 
importantly, classical alignment requires sequences to be colinear, 
which is often not the case for genome sequences due to rearrange- 
ment events. Second, even when restricted to toporthologous rela- 
tionships, the correspondences between genomes are not one to 
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one, which is also a requirement of classical alignment. Due, in part, 
to the complications of these first two issues, it is difficult to 
formulate a useful objective function (such as the sum-of-pairs 
score for classical alignment) for WGA. Thus, most genome align- 
ment methods are heuristic procedures that lack an explicit objec- 
tive. A last difference between classical alignment and WGA is the 
scale of the problem. Classical alignment typically focuses on the 
alignment of single genes, which are usually on the order of 
thousands of nucleotides long. Whole genomes, in contrast, are 
millions to billions of nucleotides in length. The facts that genomes 
are large and are often neither colinear nor in one-to-one corre- 
spondence with other genomes are what make WGA challenging. 

Since WGA is often focused on orthologous relationships, it is 
also related to the “orthology prediction” problem (see Chapter 9 
[5]). The key difference between the two problems is that orthol- 
ogy prediction is traditionally cast at the level of genes, whereas 
WGA operates at the level of nucleotides. For most orthology 
prediction methods, a genome is treated as an unordered set of 
genes. Whole-genome aligners, on the other hand, consider a 
genome to be a set of DNA sequences (chromosomes) within 
which genes are embedded. Thus, a WGA provides orthology 
predictions for both genes and intergenic regions. Due in part to 
their treatment of genomes as long nucleotide sequences, current 
WGA methods rely exclusively on sequence similarity and the 
ordering of nucleotides in a genome to predict orthology. In con- 
trast, orthology prediction methods often use phylogenetic ana- 
lyses, which can be more powerful than genome order and 
sequence similarity information alone. Thus, while the problem of 
WGA is broader in scope than that of orthology prediction, it is 
restricted to the analysis of relatively closely related genomes, for 
which homology of nongenic nucleotides is detectable and gene 
order is at least partially conserved. Gene-level orthology predic- 
tion is more appropriate for distantly related genomes, which may 
only have detectable homology at the amino acid level and little 
colinearity. 


WGAs are powerful because they allow for the analysis of molecular 
evolution at both large and small scales. At the large scale, one can 
use such alignments to estimate the frequency and location of 
rearrangement and duplication events. For example, one might 
use a WGA between human and mouse to identify colinear ortho- 
logous blocks, which are then given to a rearrangement analysis 
method (e.g., [12]) to determine a most parsimonious set of rear- 
rangement events explaining the current structures of the two 
genomes. At the small scale, WGAs can be used to examine the 
rates of substitutions and indels across the entire genome. For 
example, one might look at alignments of ancestral repeats to 
estimate the neutral rates of nucleotide evolution. Both small- 
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and large-scale mutational events identified from WGAs can be 
used as data for species tree inference. In combination with carefully 
constructed models of genome evolution at both scales, WGAs also 
enable the task of ancestral genome reconstruction [13, 14]. 

Beyond purely evolutionary studies, WGAs are valuable for 
identifying functional elements within genomes. Each class of func- 
tional element within the genome tends to have a unique “evolu- 
tionary signature,” which can be searched for within WGAs 
[15]. For example, coding sequences tend to have mutational 
patterns with a predominance of substitutions at the third positions 
of codons, which are unlikely to affect the amino acid sequence. 
This characteristic evolutionary signature of coding sequence has 
led to the development of comparative gene-finding methods, 
which often use WGAs (Chapter 6 [16]). Noncoding RNA 
sequences can also be identified from WGAs but have more com- 
plex signatures involving compensatory mutations that maintain 
base pairing within RNA secondary structures [17]. More gener- 
ally, one can search for evolutionarily constrained regions within 
WGAs, which can contain functional elements from a variety of 
classes [18]. When combined with the knowledge of transcription 
factor-binding motifs, this approach can be used to identify tran- 
scription factor-binding sites with a technique called “phylogenetic 
footprinting” [19]. The easiest evolutionarily constrained regions 
to pick out are those of “ultraconserved elements,” which maintain 
high levels of sequence identity across large evolutionary distances 
and are primarily noncoding components of the genome [20]. 

WGAs also allow for the transfer of functional information 
about specific elements from one species to another. As WGAs 
typically predict orthology and orthologous sequences are likely 
to have similar functions, WGAs are valuable for function predic- 
tion. By aligning at the nucleotide level across the genome, they can 
aid in function prediction for both genes and nongenic regions, 
such as those that contain regulatory elements. For example, if we 
are interested in a specific disease-associated interval in the human 
genome, we might use an alignment to identify where its mouse 
orthologs are located. Knowledge of the mouse orthologs would 
enable us to have a better understanding of the evolutionary history 
of this genomic region and could lead to genetic manipulation 
experiments that can only be performed in mice. 


It is easier to understand the existing methods for performing 
WGA by first appreciating the shortcomings of a simplistic 
approach for comparing whole-genome sequences. One simple 
approach would be to run BLAST [21], or another similar local 
alignment tool, between all pairs of genomes. The WGA would 


128 Colin N. Dewey 


3.2 The Two Major 
Approaches to WGA 


then be defined as the union of all significant pairwise local align- 
ments discovered by BLAST. By using a local alignment tool, we 
avoid the issues of rearrangements and duplications, as sets of local 
alignments are not constrained to be colinear or in one-to-one 
correspondence. 

While this approach would certainly yield a large set of homol- 
ogy predictions between all pairs of genomes, it has a number of 
shortcomings. First, by only using a BLAST significance threshold, 
it makes no distinction between orthology, paralogy, and other 
refinements of homology. Second, the pairwise alignments that it 
produces are not guaranteed to be consistent with each other, even 
though homology, by definition, is a transitive relation. Third, 
BLAST may miss some homologous sequences that have low simi- 
larity but are strongly supported in their relatedness by flanking 
homologous sequences. BLAST’s significance statistics are proven 
for ungapped sequences and good in practice for sequences with 
short indels [22], but are not designed for whole-genome compar- 
isons, which often feature large-scale insertions and deletions and 
heterogeneous substitution rates. Lastly, this approach is overly 
computationally intensive. For example, it does not take advantage 
of the fact that homology is a transitive relation, that relationships 
between sequences are reasonably modeled by a tree, and that 
homologous sequences between genomes are often found in long 
colinear segments. 


Existing WGA methods attempt to address one or more of the 
weaknesses of this simple approach. These methods can be loosely 
classified into two major strategies which we refer to as the “hierar- 
chical” and “local” approaches. The main idea behind the hierar- 
chical approach is to split the WGA problem into a set of global 
multiple alignment problems. To do this, it first identifies the 
colinear and homologous (typically orthologous) segments of the 
genomes. Each set of colinear segments is then given to a 
specialized genomic global alignment method to produce a 
nucleotide-level alignment. In contrast, the first step of the 
“local” approach is to produce a large set of nucleotide-level align- 
ments. Later steps involve the filtering and merging of these align- 
ments to produce sets of pairwise or multiple alignments of 
homologous (typically orthologous) sequences. Despite their dif- 
ferences, both strategies typically begin with a local alignment step 
that is similar to the simplistic all-vs.-all alignment of the BLAST 
approach. A summary of all of the WGA methods described in this 
chapter and the role they play within one or both approaches is 
given in Table 1. 

Both approaches have advantages and disadvantages. The pri- 
mary advantage of the hierarchical approach is that it can often be 
faster and breaks a WGA into a number of independent subpro- 
blems that can be solved in parallel. It is faster because the 
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Table 1 
A list of the WGA methods cited in this chapter 


Relationships Pairwise or 
Method Category predicted multiple References 
BLAST Local alignment Homology Pairwise EAR 
BLAT Local alignment Homology Pairwise [32] 
STELLAR Local alignment Homology Pairwise [33] 
LASTZ Local alignment Homology Pairwise [34] 
LAST Local alignment Homology Pairwise [28] 
MUMmer Local alignment Orthology Pairwise [35] 
CHAOS Local alignment Homology Pairwise [36] 
GRIMM -Synteny Homology mapping Toporthology Multiple [40] 
DRIMM-Synteny Homology mapping Homology Multiple [45] 
Mercator Homology mapping Toporthology Multiple [46] 
Enredo Homology mapping Homology Multiple [47] 
OSfinder Homology mapping Toporthology Multiple [48] 
SuperMap Homology mapping Homology Multiple [49] 
Sibelia Homology mapping Homology Multiple [50] 
M-GCAT Hierarchical WGA Toporthology Multiple [51] 
progressiveMauve Hierarchical WGA Toporthology Multiple [52] 
MUGSY Hierarchical WGA Toporthology Multiple [53] 
Cactus Hierarchical WGA Homology Multiple [54] 
MAVID Global genomic Colinear homology Multiple [60] 
alignment 
LAGAN/Multi- Global genomic Colinear homology Pairwise/ [37] 
LAGAN alignment multiple 
DIALIGN Global genomic Colinear homology Multiple [36] 
alignment 
SeqAn::T-Coffee Global genomic Colinear homology Multiple [61] 
alignment 
Pecan Global genomic Colinear homology Multiple [47] 
alignment 
FSA Global genomic Colinear homology Multiple [62] 
alignment 
NUCmer/PROmer Local WGA Orthology Pairwise [35] 
MULTIZ/TBA Local WGA Homology Multiple [8] 
AXTCHAIN/ Alignment chaining Orthology Pairwise [67] 
CHAINNET and filtering 


(continued) 
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Table 1 
(continued) 
Relationships Pairwise or 

Method Category predicted multiple References 
PicoInversionMiner Alignment refinement Orthology Pairwise [68] 
Cassis Alignment refinement Orthology Pairwise [69, 70] 
GenAlignRefine Alignment refinement Colinear homology Multiple [71] 
PSAR-Align Alignment refinement Colinear homology Multiple [73] 
Phylo Alignment refinement Colinear homology Multiple [76, 77] 
SLAM Alignment refinement Colinear homology Pairwise [78] 
DOUBLESCAN Alignment refinement Colinear homology Pairwise [79] 
CESAR Alignment refinement Colinear homology Pairwise [81] 
MORPH Alignment refinement Colinear homology Pairwise [82] 
EMMA Alignment refinement Colinear homology Pairwise [83] 
MAFIA Alignment refinement Colinear homology Multiple [84] 
SAPF Alignment refinement Colinear homology Multiple [85] 
REAPR Alignment refinement Colinear homology Multiple [86] 


For each method, the approach it uses or the role it plays within a larger WGA system is given in the “category” column. 
Each method is labeled as either “pairwise” or “multiple” depending on whether it can be applied to generate multiple 


alignments. In addition, the primary type of evolutionary relationship predicted by each method is given in the “relation- 


ships predicted” column 


3.3 Local Pairwise 
Genomic Alignment 


identification of long colinear and orthologous segments in the 
genomes can be accurately computed without the need for sensitive 
nucleotide-level alignments. However, because hierarchical meth- 
ods do not often use the most sensitive aligners for this step, they 
tend to miss small rearranged or diverged segments. Thus, the 
primary advantage of the local method is in its sensitivity to these 
regions, although “glocal” alignment methods [23], which allow 
for small rearrangements, can partially ameliorate this weakness of 
hierarchical methods. Hierarchical methods also run the risk of 
being overconfident of the colinearity of genomic segments and 
can thus produce more false-positive aligned positions within 
sequences predicted to be colinear. 


Methods for both WGA strategies generally start by finding local 
alignments between, and perhaps within, the genomes. The 
Smith-Waterman algorithm is the classical solution to the pairwise 
local alignment problem, but is generally not used for WGA 
because it runs in time quadratic in the size of the genomes, 
which can be large. Instead, most methods adopt a “seed-and- 
extend” approach for discovering high-scoring local alignments, 
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much like BLAST. This approach first identifies short ungapped 
matches between the sequences using one of a variety of data 
structures. It then extends the short matches from both ends 
using a variant of the Smith-Waterman algorithm, stopping the 
extension when the score of the alignment drops below a specified 
threshold. In some cases, nearby and consistent (in terms of order 
and orientation) local alignments are “chained” together to form 
larger alignments. 

There are a number of techniques used for discovering seeds at 
the genomic scale for the “seed-and-extend” approach to local 
alignment. A first distinction between the techniques is whether 
they find exact or inexact matching seeds. Exact seed discovery is 
often faster and easier to implement, whereas inexact seeds offer 
better sensitivity. Seed techniques also vary in whether they use 
“consecutive” or “spaced” seeds [24]. Consecutive seeds consider 
matches and mismatches at all positions within a sequence interval, 
whereas spaced seeds only check for matches at a subset of positions 
within an interval. The specific subset of positions checked is known 
as the “seed pattern,” and there has been significant work on 
determining optimal sets of multiple seed patterns (e.g., 
[25, 26]). It has been shown that carefully chosen spaced seed 
patterns are superior to consecutive seeds in terms of sensitivity 
[27]. Lastly, seeds differ in whether their lengths are fixed or 
adaptive (variable). For WGA, adaptive seeds have been shown to 
allow for faster local alignment at the same level of sensitivity as 
fixed seeds [28 ]. 

Seed-finding techniques can often be improved by taking 
advantage of DNA evolutionary models. A generalization of spaced 
seeds is “subset seeds” [29], which allow subsets of bases to be 
considered equivalent when determining if there is a match at a 
given position. Subset seeds are particularly useful for taking into 
account that transitions are often more common than transversions 
in genome comparisons. Further taking into account biologically 
informed substitution patterns is the “translated” seed, which is a 
match at the amino acid level after translating genomic sequences in 
all six possible reading frames. Translated seeds enable increased 
sensitivity in comparisons of more diverged genomes. Lastly, when 
aligning a genome to a set of genomes for which a multiple WGA 
has already been constructed, one can take into account the substi- 
tution patterns and ancestral sequences inferred from the WGA to 
devise more sensitive seeds [30, 31]. 

The choice of seed type is the major determinant of the data 
structures used for seed discovery. For example, BLAT [32] uses a 
simple index of all possible k-mers for exact and translated seeds but 
uses a heuristic of indexing only nonoverlapping k-mers for mem- 
ory efficiency. STELLAR [33] also uses an index of k-mers but 
implements an exact algorithm based on filtration for finding all 
local alignments with an error rate below a given threshold. LASTZ 
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(the successor to BLASTZ [34]), which uses a carefully chosen 
spaced seed pattern introduced by [24], instead uses a hash table 
to find both exact and inexact matches. Not to be confused with 
LASTZ is the more recently developed LAST aligner [28], which 
uses adaptive seeds with highly configurable patterns that are iden- 
tified via a suffix array data structure. MUMmer uses a suffix tree to 
rapidly find all exact consecutive seeds with some minimum length 
[35]. CHAOS [36], which is a component of the LAGAN-suite of 
genome alignment tools [37], uses a related structure, a “threaded 
trie,” to find exact and inexact consecutive seeds. 

For computational efficiency reasons, the extension step of the 
seed-and-extend approach typically only allows for ungapped align- 
ments or alignments with short indels. However, genome align- 
ments often feature large indels that are not discovered by 
extension from a seed. Thus, many local genomic alignment tools 
use a “chaining” step to link nearby and consistent local alignments 
discovered by the seed-and-extend strategy. For example, MUM- 
mer includes a module for chaining together nearby exact matches 
using a variation of the longest increasing subsequence (LIS) prob- 
lem [38]. CHAOS also uses an LIS-derived algorithm for chaining 
the inexact consecutive seeds it discovers. Chaining is often fol- 
lowed by more sensitive alignment between chained local align- 
ments. For example, MUMmer runs a variant of Smith-Waterman 
alignment in between chained matches and LASTZ recursively 
searches for alignments with more sensitive seeds in between nearby 
alignments discovered in previous steps. 


The hierarchical approach to WGA consists of two steps. First, a 
high-level homology map between the genomes is constructed. 
Second, a nucleotide-level alignment is obtained by running a 
genomic global alignment tool on each homologous and colinear 
set of genomic segments identified by the homology map. Hierar- 
chical WGA methods vary in the exact techniques used for 
each step. 

The idea behind the hierarchical approach is to separate the 
problem of identifying rearrangements and duplications from that 
of obtaining a nucleotide-level alignment. In the absence of rear- 
rangements and duplications, WGA simply reduces to classical 
sequence alignment although at a much larger scale. Thus, if a 
WGA problem can be broken into a set of subproblems that do 
not contain these large-scale events, the numerous methods that 
have been developed for classical global alignment can be utilized. 

The first step of the hierarchical strategy is to construct a 
homology map between the genomes of interest. A homology 
map is a collection of sets of genomic intervals, where each set of 
intervals is required to be homologous and colinear (e, free of 
rearrangements and duplications). Each set represents the 
sequences that will ultimately form a block within a WGA. 
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Homology maps generally have the property that each genomic 
position belongs to at most one set and has all of its homologs 
contained within that set. For WGA, homology maps are often 
restricted in the evolutionary relationships that are captured, as 
only a subset of homologous relationships may be of interest. 
Typically, only orthologous relationships are captured, forming an 
“orthology map.” When orthology maps are restricted to predict- 
ing one-to-one relationships, they are more likely to be representa- 
tive of toporthology. 

The concept of a homology map is closely related to the con- 
cepts of “conserved segments” and “syntenic blocks,” which gen- 
erally refer to sets of genomic intervals containing multiple 
homologous markers (e.g., genes) and featuring conserved orien- 
tations and adjacencies of these markers [39, 40]. Unfortunately, 
these concepts have long been poorly defined, and, as a result, 
methods for syntenic block identification differ markedly in their 
output [41]. In addition, methods for identifying syntenic blocks 
(or closely related concepts) are often focused on identifying sets of 
genomic intervals that exhibit levels of conservation of marker 
content or colinearity that exceed what one would expect if markers 
were randomly shuffled between genomes (e.g., [42-44 ]). This is 
in contrast to homology maps, which are concerned with colinear 
homology, regardless of biological significance. And, in practice, 
homology maps are intermediate objects in the process of WGA, 
whereas syntenic block predictions are often of direct interest. 

Homology maps are most commonly constructed from local 
alignments, such as those computed by methods discussed in the 
previous section. As only a high-level correspondence is desired, 
these methods are often run in faster but less sensitive configura- 
tions. For example, local alignments between just the coding inter- 
vals of the genomes can be computed quickly and used for the 
construction of homology maps that are at least accurate with 
respect to protein-coding genes. 

Although numerous pairwise homology mapping methods 
exist, in this chapter, we restrict our attention to methods that 
scale to more than two genomes, as the problem is significantly 
more challenging in the multiple genome case. Examples of multi- 
ple genome homology map methods include GRIMM-Synteny 
[40], its successor DRIMM-Synteny [45], Mercator [46], Enredo 
[47], OSfinder [48], SuperMap [49], and Sibelia [50]. The WGA 
programs M-GCAT [51], progressiveMauve [52], MUGSY [53], 
and Cactus [54] are integrated hierarchical methods that contain a 
homology mapping stage. 

Many of these methods use graph-based data structures to find 
a mapping between multiple genomes simultaneously. Kehr et al. 
[55] characterized the relationships between four commonly used 
types of graphs: alignment graphs [56], A-Bruzjn graphs [57, 58], 
Enredo graphs [47], and Cactus graphs [59]. The most 
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straightforward graph is the alignment graph, which is a mixed 
graph with vertices representing genomic segments, directed 
edges representing adjacent segments, and undirected edges repre- 
senting homologous segments. In an A-Bruijn graph, vertices 
instead represent sets of homologous segments, and directed 
edges represent adjacencies between pairs of segments (one from 
each set represented by the connected vertices). Relative to align- 
ment graphs, A-Bruijn graphs are more compact and readily reveal 
the content of each genome. An Enredo graph is very similar to an 
A-Bruijn graph, but has a pair of vertices instead of a single vertex 
for each set of homologous segments, which captures information 
regarding the directionality of each segment within a homologous 
set. Lastly, cactus graphs flip the representation of adjacencies, with 
vertices corresponding to sets of adjacencies and edges 
corresponding to sets of homologous segments. Cactus graphs 
have a natural decomposition that provides advantages for analysis 
and visualization of WGAs. 

Graph-based homology mapping methods generally produce 
an initial WGA graph using one of the four representations we have 
discussed and then refine the graph via modifications. Of the 
homology mapping methods we have listed, GRIMM-Synteny, 
Mercator, and MUGSY use alignment graphs. DRIMM-Synteny 
and OSfinder use A-Bruijn graphs and Sibelia uses de Bruijn 
graphs, of which A-Bruijn graphs are a generalization. And, as 
their names suggest, Enredo and Cactus use Enredo and cactus 
graphs, respectively. These methods use a variety of techniques for 
graph refinement. For example, MUGSY is unique in its use of flow 
network algorithms to identify breaks in colinearity. OSfinder uses a 
novel probabilistic model to determine a maximum likelihood 
multiple genome orthology map. And Cactus uses a simulated 
annealing-style algorithm, the Cactus alignment filter, to refine an 
initial cactus graph representing a homology map. 

Unlike the graph-based methods that build a map between all 
genomes simultaneously, the SuperMap and progressiveMauve 
methods build a multiple genome map by progressively building 
pairwise maps up a guide tree. The pairwise SuperMap algorithm is 
essentially a symmetric version of the chaining method used by 
Shuffle- LAGAN [23], which allows for rearrangements and dupli- 
cations in its chains of orthologous segments. The progressive- 
Mauve mapping method instead uses a “breakpoint elimination” 
algorithm to find colinear segments and does not allow for duplica- 
tions, thus producing output indicative of one-to-one toporthol- 
ogy. This algorithm greedily removes local alignments one by one 
with the goal of maximizing an objective function that takes into 
account both the number of breakpoints implied by an alignment 
and substitution scores. 

Once a homology map has been created, any one of a number 
of genomic global alignment methods can be used to align the 
orthologous and colinear segments identified by the map. As for 
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our discussion of homology mapping methods, we restrict our 
attention to global aligners that can handle multiple genomes. 
Examples of such methods are MAVID [60], MLAGAN [37], 
DIALIGN [36], SeqAn::T-Coffee [61], PECAN [47], FSA [62], 
and the base-level alignment refinement (BAR) algorithm of Cactus 
[54]. For colinear sequences, the genomic alignment problem is 
the same as that of classical global alignment but is made more 
difficult by the fact that the sequences are long (possibly millions of 
nucleotides in length). Thus, global genomic aligners employ heur- 
istics to speed up the process. By far, the most common heuristic 
used is to first identify short local alignments, or anchors, between 
the sequences, identify a chain of these anchors, and then perform 
global alignment between the adjacent chained anchors. This tech- 
nique is similar to the strategy for hierarchical WGA, but is simpler, 
due to the fact that rearrangements and duplications do not need to 
be taken into account. MLAGAN and DIALIGN use the CHAOS 
local aligner, PECAN and FSA use Exonerate [63], and MAVID 
and SeqAn::T-Coffee use suffix trees or arrays to find anchors. 

In addition to the specific local alignment technique used to 
speed up the alignment process, global genomic aligners also vary 
with respect to how they combine local pairwise alignments to 
build a multiple global alignment. First, MAVID, MLAGAN, 
SeqAn::T-Coffee, and Pecan all belong to the class of progressive 
alignment methods, which use a phylogenetic tree to guide their 
algorithms (see Chapter 7 [1]). For the alignment of non-leaf 
sequences during progressive alignment, MAVID uses maximum 
likelihood ancestral sequence inference, while MLAGAN, SeqAn:: 
T-Coffee, and Pecan use a sum-of-pairs objective function. Both 
SeqAn::T-Coffee and Pecan use a “consistency” technique, which 
adjusts the score between pairs of positions (or segments) based on 
the consistency of triplets of pairwise alignments. The nonprogres- 
sive methods, DIALIGN, FSA, and BAR, instead put together a 
multiple alignment by greedily merging consistent local pairwise 
alignments. While differing in their use of a tree, the FSA, Pecan, 
and BAR methods take advantage of probabilistic models of 
sequence alignment and attempt to maximize statistically grounded 
objective functions, as opposed to the heuristic score-based func- 
tions used by the other methods. BAR is unique in its ability to 
predict breakpoints when aligning groups of sequences that may 
contain the boundaries of rearrangement events. 

Although the hierarchical approach breaks the WGA problem 
into a large number of subproblems (one per colinear segment set) 
that can be computed in parallel, it is still a significant computa- 
tional effort to produce a WGA with this approach, particularly for 
large eukaryotic genomes. Thus, a number of Web sites host pre- 
computed hierarchical WGAs. Alignments produced by the combi- 
nation of Pecan with either Enredo or Mercator are hosted at the 
Ensembl Web site [64]. Similarly, the VISTA Web site [65] hosts 
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WGAs generated by SuperMap and the LAGAN-suite of genomic 
aligners. Both sites offer visualizations of the WGAs, which are 
useful for looking at levels of conservation across genomes. 


The local approach to WGA bypasses the high-level homology map 
construction phase of the hierarchical approach and instead begins 
by identifying a comprehensive set of nucleotide-level pairwise local 
alignments. The second step of this approach is to combine the 
pairwise local alignments into a cohesive WGA by filtering out 
nonorthologous relationships and merging pairwise alignments 
into multiple alignments. Because there is typically no additional 
pairwise nucleotide-level alignment performed in the second step, 
the local alignments generated by the first step are obtained with a 
more sensitive aligner than that used by hierarchical methods for 
homology map building. The two primary examples of local WGA 
methods are MUMmer, a pairwise genome aligner, and MULTIZ/ 
TBA, a multiple genome aligner [8]. 

MUMmer was one of the first pairwise WGA methods to be 
developed and was initially targeted at the alignment of 
prokaryotic-sized genomes. The WGA ability of MUMmer is 
achieved through a combination of smaller modules that is orche- 
strated by the NUCmer or PROmer scripts. The first module 
identifies maximum unique matches (MUMs) between a pair of 
genomes with a suffix tree data structure. Nearby matches are 
clustered together, and a high-scoring colinear chain of matches is 
identified within each cluster. Finally, the matches within the chains 
are extended with a variant of the Smith-Waterman algorithm, and 
the resulting extended chains are output as a WGA. The raw WGA 
output by MUMmer can, in general, include all classes of homolo- 
gous relationships. However, the chains are typically filtered to 
leave only those that are highest scoring or that result in a reference 
position being overlapped by only a single chain. Thus, a filtered 
WGA from MUM mer is usually representative of orthology. 

MULTIZ/TBA, which was instead designed for large eukary- 
otic genomes, starts by using LASTZ to generate sensitive local 
pairwise alignments between all pairs of genomes or between a 
reference genome and all others. MULTIZ is then used to identify 
local alignment blocks of subsets of genomes that should be com- 
bined and to merge these blocks using a banded variant of the 
Smith-Waterman algorithm. TBA is the program that is used to 
coordinate this entire process when all pairs of genomes are com- 
pared. Thus far, it does not appear that TBA has been used at the 
whole-genome scale, although MULTIZ is regularly used for 
reference-based WGAs hosted by the UCSC Genome Browser 
[66]. For these reference-based WGAs, the ungapped segments of 
LASTZ alignments are first processed with a chaining program 
(AXTCHAIN) to establish large colinear alignments between the 
reference and another genome. In contrast to the output of 
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chaining methods discussed in Subheading 3.3, a chain produced 
by AXTCHAIN is an ordered set of pairwise local alignments rather 
than a single long alignment that explicitly aligns between the short 
local alignments that form the chain. AXTCHAIN chains are typi- 
cally filtered by the CHAINNET program to retain only the 
highest-scoring alignment at each position within the reference 
genome [67]. The remaining alignments, which most likely reflect 
orthologous relationships, are then combined into multiple align- 
ments with MULTIZ. 


Because of the computational complexity of multiple alignment, 
particularly at the whole-genome scale, methods of both 
approaches to WGA use heuristics and simplified models to make 
WGA feasible. For example, most of the methods described in this 
chapter do not distinguish between different classes of genomic 
sequence (e.g., genic and intergenic) while constructing 
nucleotide-level alignments. And many methods disregard small, 
marginally significant, local alignments for the sake of speed. As a 
result, at a local level, the results of current WGA methods often 
leave room for improvement. 

To remedy this situation, a number of methods have been 
developed that may be used to refine WGAs. These methods take 
as input either a WGA, a single WGA block, or the set of homolo- 
gous and colinear sequences that make up a WGA block. They can 
be generally grouped into one of three categories. The first is 
composed of methods that refine the local structure of a WGA. 
That is, they redefine the boundaries, or “breakpoints,” of the 
homologous and colinear blocks in the WGA. A secondary cate- 
gory of methods focuses on optimizing individual WGA blocks 
with respect to an objective function. The last category includes 
methods that perform alignment while taking into account the 
structure and evolutionary dynamics of certain classes of genomic 
elements. 

PicoInversionMiner [68] and Cassis [69, 70] are two methods 
for refining the local structure of a WGA. PicoInversionMiner 
identifies very small “inplace” inversions between two genomes 
that are left undetected by an initial WGA. Such inversions are 
represented by alignments that would typically not have statistically 
significant scores at the genome level but can be detected via 
probabilistic models of local sequence evolution. In contrast to 
PicoInversionMiner, which identifies novel rearrangement events, 
Cassis refines the coordinates of breakpoints. The refinements pro- 
duced by Cassis are the result of identifying weak similarities 
between sequences adjacent to segments of an initial orthology 
map and extending the boundaries of segments based on these 
similarities. The BAR algorithm of Cactus, which we have previ- 
ously discussed in the context of hierarchical WGA, is also an 
alignment refinement method that identifies breakpoints. 
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Other methods for refining WGAs focus on improving local 
colinear multiple alignments with respect to a given objective func- 
tion. For example, GenAlignRefine [71] attempts to optimize 
WGA blocks according to the COFFEE objective function [72] 
using a genetic algorithm. The PSAR-Align method [73] instead 
realigns blocks to optimize an expected accuracy objective function 
[74] using pairwise alignment probabilities estimated by the PSAR 
tool [75] and the sequencing annealing algorithm of the FSA 
multiple alignment method [62]. Lastly, the Phylo project 
[76, 77] refines WGAs by “crowd sourcing” the task of optimizing 
colinear alignment blocks, according to one of a number of objec- 
tive functions. Phylo casts the multiple alignment problem as a 
casual game that may be played by “citizen scientists” at the pro- 
ject’s website (http://phylo.cs.mcgill.ca/). 

Lastly, a number of methods have been developed that can 
improve the alignments of specific classes of genomic elements, 
such as gene structures. The primary goal of these methods is 
generally to improve prediction of genomic elements, but a more 
accurate alignment often results as a side product. Among the 
oldest of such methods are comparative gene finders that perform 
protein-coding gene prediction and pairwise alignment simulta- 
neously. These include SLAM [78] and DOUBLESCAN [79], 
both of which use pair hidden Markov models [80]. A related 
method, CESAR [81], was specifically designed for realignment 
and targets individual coding exons rather than full gene structures. 
Other methods focus on improving the alignment of noncoding 
regulatory regions by modeling the evolution of sets of transcrip- 
tion factor-binding sites with known motifs (e.g., MORPH [82], 
EMMA [83], and MAFIA [84]). Like the comparative gene finders, 
these methods also use statistical alignment techniques but with 
models extended to take into the account the conservation of 
binding sites instead of gene structures. SAPF [85] is also a method 
aimed at alignment of noncoding regulatory regions but more 
generally models sequences that are mixtures of “slow” and “fast” 
evolving elements without knowledge of binding motifs. Lastly, 
REAPR [86] focuses on the realignment and detection of noncod- 
ing RNAs by using alignment models that take into account the 
conserved secondary structures of such RNAs. 


4 Evaluation of WGAs 


Just as for small-scale alignment (Chapter 7, [1]), assessing the 
accuracy of WGAs is hard because we rarely know the true evolu- 
tionary history of a set of genome sequences. In fact, the evaluation 
of WGAs is even harder than that of protein alignments. While 
protein aligners can be evaluated with “gold standard” benchmark- 
ing databases where the truth is established through protein 
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structural information, genome aligners have no benchmarks of 
real data. In addition, WGAs must be assessed not only for whether 
they align truly homologous sequences but also for whether they 
correctly predict orthologous (or toporthologous) relationships. 
Thus, the evaluation of WGAs is related to that of gene orthology 
prediction, which is discussed in Chapter 9 [5]. Despite these 
challenges, a number of creative approaches have been used for 
determining the accuracy of WGA methods. The approaches gen- 
erally fall into four categories: (1) simulation, (2) analysis of align- 
ments to annotated regions, (3) comparison with predictions from 
other methods, and (4) alignment statistics. 

Simulated data are appealing for evaluation as we know the 
entire evolutionary history of the simulated sequences and can 
thus thoroughly evaluate the accuracy of an alignment. Many of 
the WGA methods described in this chapter have used simulations 
for assessing their accuracies [8, 47, 52, 54, 62]. The Alignathon 
[87], one of the most comprehensive evaluations of WGA methods 
to date, relied heavily on simulated data sets. This study called 
attention to one potential pitfall of simulation-based evaluation, 
which is that the performance of a WGA method may be over- 
estimated when that method was developed or trained with respect 
to the same simulator used for the assessment. 

Simulating the evolution of whole genomes is a challenging 
task, and it is unclear if the current models used for simulation are 
close to reality. Such models are highly complex, as they have to 
account for many different types of evolutionary events, at both the 
small and large scales. For example, they need to model the random 
mutations of both single-nucleotide substitutions and megabase- 
sized inversions. In addition, they also need to model natural 
selection, which alters the probability of these random mutations 
becoming fixed within a population. For example, an inversion that 
cuts an essential gene in half might have a much lower probability of 
becoming fixed than an inversion with both end points in inter- 
genic regions. Despite these challenging model details, a number of 
genomic evolution simulators have been developed. Currently, only 
three simulators model both small-scale events (e.g., substitutions 
and indels) and large-scale rearrangements and duplications 
[88-90]. Other simulators focus only on nonrearranging events 
[8, 91-98] and are thus good for evaluating colinear genomic 
aligners but not homology mapping methods. 

A second class of approaches to evaluating WGAs leverages our 
knowledge of various classes of elements within the genome. For 
example, with our understanding that most coding regions are 
conserved across closely related genomes, the fraction of exons in 
a genome “covered” by an alignment is an indirect measure of the 
sensitivity of a WGA [37, 49, 60, 99]. Specificity can also be 
roughly assessed with coding regions, either by counting the num- 
ber of coding bases that are aligned to noncoding bases in other 
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genomes [36, 100] or by checking that alignments in coding 
regions exhibit periodicities in their substitution patterns [99]. A 
related approach that instead assesses the accuracy of eukaryotic 
orthology maps is to check if exons from the same gene are mapped 
in the same order and orientation to other genomes [47]. For the 
subset of protein-coding and noncoding RNA genes that have 
curated “gold standard” alignments, the accuracy of a WGA with 
respect to those genes may be assessed [101]. However, the fact 
that genic regions are often highly conserved is also a disadvantage 
of using them for evaluation; the most conserved regions are the 
easiest to align, and some aligners use exon annotation information 
or translated matches. Because of these issues, repeat sequences, 
which are believed to evolve more neutrally, have been used for 
alignment evaluation [47, 99]. For example, in [99], sensitivity was 
assessed by alignments of ancestral repetitive elements, and speci- 
ficity was inferred from the number of alignments to lineage- 
specific repeat elements (in this study, primate-specific Alu repeats). 

Another common evaluation technique is to compare whole- 
genome aligners against other related methods. For example, a 
WGA produced by one method can be used as the “truth” with 
which to evaluate the sensitivity and specificity of other WGAs 
[53]. This technique is useful for judging the similarity of different 
WGAs but, unfortunately, does not provide much information 
about accuracy. Another technique is to compare with the results 
from gene orthology prediction programs [48, 49]. The advantage 
of this approach is that it provides a more independent test of 
accuracy, since gene orthology prediction programs generally use 
different algorithms and information sources to infer orthology. 
The disadvantages of this approach are that it only provides a gene- 
level measure of accuracy and does not evaluate alignments of 
noncoding regions. In addition, since WGA and gene orthology 
prediction share similar goals, we might expect that future methods 
will blend techniques from both and thus that this evaluation 
approach will decrease in usefulness. 

A last class of evaluation techniques involves the computation 
of statistics for WGAs. These statistics can be subdivided into 
simple descriptive statistics and measures computed via statistical 
or sampling techniques. One of the most straightforward descrip- 
tive statistics of a WGA is the “coverage” or the fraction of the 
genomes included in an alignment or orthology map block [45, 47, 
49, 53, 87]. Generally, the higher the coverage, the more sensitive 
the WGA is believed to be, although one can easily create high- 
coverage WGAs with poor sensitivity. As a check of large-scale 
specificity in mammalian WGAs, the authors of [47] checked the 
fraction of the X chromosome that was covered by alignments to 
autosomal chromosomes in other genomes (the assumption being 
that translocations into and out of the X chromosome are rare in 
mammals). Some more detailed nucleotide-level statistics of WGAs 
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include the total number of “core” positions [53], which are 
gap-free alignment columns containing all genomes, and the aver- 
age level of sequence identity in aligned columns [61]. 

More sophisticated statistics related to WGA accuracy are com- 
puted through the use of statistical or sampling techniques. Just as 
they are used for BLAST, Karlin and Altschul statistics [102] may 
be used to assess the significance of local pairwise alignments 
between genomes. StatSigMA extends these statistics to multiple 
alignments [103], and StatSigMA-w further extends this technique 
to detect dubiously aligned regions in WGAs of multiple genomes 
[104]. Whereas a given local pairwise alignment may be highly 
significant, the flanks of that alignment may be spurious, and a 
p-value may be computed assessing the possible “over-alignment” 
of a flank [105]. Within a multiple alignment, a number of techni- 
ques have been developed for estimating the accuracy of the align- 
ment of pairs of residues or entire columns, including simply 
computing an alignment of reversed sequences [106], computing 
alignments with bootstrapped guide trees [107], sampling subop- 
timal multiple alignments [75 ], and evaluating consistency within a 
library of alternative alignments [108]. 


5 Future Challenges 


Despite the substantial progress made in WGA methodology devel- 
opment, there are a number of challenges that remain unsolved. 
First, we are in need of WGA methods that can scale to hundreds or 
thousands of genomes. Along with ever-improving sequencing 
technology, we are accumulating whole-genome sequences at an 
increasing rate. Projects such as the Genome 10K Community of 
Scientists [109], which aims to collect and sequence the genomes 
of 10,000 vertebrate species, will further push the WGA problem to 
new scales. While most WGA algorithms have been made efficient 
for long genomes, very few are practical for large numbers of 
genomes. Encouragingly, we are beginning to see methods capable 
of scaling to thousands of genomes for the simpler task of “core- 
genome alignment” of highly similar microbial-sized genomes 
[110]. However, methods scaling to thousands of genomes for 
the full WGA task or for mammalian-sized genomes do not cur- 
rently exist. In addition to algorithmic advances, we will also be in 
need of novel approaches for storing and representing WGAs of 
thousands of genomes. 

Second, advances are needed in the parameterization of WGA 
methods. Current methods are littered with large numbers of 
parameters that are often heuristic in nature and not easily deter- 
mined. In some cases, the default parameters for a WGA method 
may be markedly suboptimal [111]. One solution to this problem is 
to adopt probabilistic models, which offer principled approaches to 
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6 Exercises 


parameter estimation, such as maximum likelihood. In fact, proba- 
bilistic models of sequence evolution have already been adopted for 
the alignment of colinear genomic segments and have been shown 
to offer improved accuracy [47, 62]. However, we have yet to see a 
method that integrates probabilistic models of both small- and 
large-scale changes that is capable of constructing an entire WGA, 
although the recently introduced “split-alignment” pairwise WGA 
method is a promising step in this direction [112]. In addition, 
most WGA alignments use models or scoring schemes that assume 
homogenous rates of evolution across the genome. This assump- 
tion is obviously violated in real data, and new methods will need to 
be developed that take this into account. Simulated noncoding 
genomic alignments that represent a heterogeneous mix of evolu- 
tionary rates have been developed and should be useful for the 
development of new WGA methodology [97]. 

Lastly, more attention must be paid to the fact that a WGA is 
typically just a single estimate of the evolutionary history of a set of 
genomes and portions of this estimate may be highly uncertain. 
Encouragingly, methods for colinear genomic alignment have 
brought light to this issue at the nucleotide level [62, 113]. How- 
ever, the issue of uncertainty at the large-scale orthology map level 
has not been sufficiently studied, perhaps due to the lack of proba- 
bilistic models for that level of the WGA problem. In addition, 
most efforts to address uncertainty in alignments simply assign 
levels of confidence to the components of a single alignment. It 
may be more useful to be presented with a set of near-optimal 
alignments so that alternative evolutionary histories can be exam- 
ined by downstream analyses [114]. The determination and repre- 
sentation of uncertainty for all scales of a WGA will likely remain a 
challenging problem as the number of genomes included in align- 
ments increases. 


1. Download the whole-genome aligner MUMmer (http://mum 
mer.sourceforge.net) and FASTA-formatted genome sequences for 
the species Helicobacter pylori J99 and Helicobacter pylori B38 from 
GenBank (http://www.ncbi.nlm.nih.gov/genbank/, accessions 
NC000921 and NC012973, respectively). Run the NUCmer or 
PROmer programs on the two genome sequences. Visualize the 
resulting alignment with the mummerplot program. How many 
colinear blocks are there in the alignment? How many inversion 
events are implied by the alignment? 

2. Visit the UCSC Genome Browser (http://genome.ucsc. 
edu) and browse the human genome version GRCh38/hg38. 
Search for and view the CFTR gene, mutations in which cause the 
disease cystic fibrosis. Turn on the Net tracks for alignments to 
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Fig. 3 The evolutionary scenario to be considered for Exercise 3. Each bullet-like shape corresponds to a 
genomic segment, with the direction of the bullet indicating the orientation of the segment 


genomes of non-primate placental mammals by clicking on the 
“Placental Chain/Net” link (in the “Comparative Genomics” sec- 
tion) and choosing the appropriate configuration. Examine the 
Mouse Net track in the visualization and note the color of the 
mouse net alignments. Using the “Chromosome Color Key” 
(located in between the browser visualization and the track config- 
uration section), identify the chromosome on which the mouse 
ortholog of CFTR is located. Looking at the net alignments for 
all of the placental mammals, does it appear that CFTR has been 
conserved across this clade? 

3. Consider the evolutionary scenario giving rise to the gen- 
omes of three species shown in Fig. 3. For each of the relations 
listed below, give the pairs of genomic segments with that relation. 


(a) Orthology 
(b) Paralogy 
(c) Toporthology 
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Inferring Orthology and Paralogy 
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Abstract 


The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplica- 
tion, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function 
annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy. 
We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches, 
which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differ- 
ences among the various orthology inference methods and databases and examine the difficult issue of 
verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous 
genes, groups, and reconciled trees and conclude with thoughts on future methodological developments. 


Key words Orthology, Paralogy, Tree reconciliation, Orthology benchmarking 


1 Introduction 


The study of genetic material almost always starts with identifying, 
within or across species, homologous regions—regions of common 
ancestry. As we have seen in previous chapters, this can be done at 
the level of genome segments [1], genes [2 ], or even down to single 
residues, in sequence alignments [3]. Here, we focus on genes as 
evolutionary and functional units. The central premise of this chap- 
ter is that it is useful to distinguish between two classes of homolo- 
gous genes: orthologs, which are pairs of genes that started diverging 
via evolutionary speciation, and paralogs, which are pairs of genes 
that started diverging via gene duplication [4] (Fig. 1, Box 1). 
Originally, the terms and their definition were proposed by Walter 
M. Fitch in the context of species phylogeny inference, i.e., the 
reconstruction of the tree of life. He stated “Phylogenies require 
orthologous, not paralogous, genes” [4]. Indeed, since orthologs 
arise by speciation, any set of genes in which every pair is ortholo- 
gous has by definition the same evolutionary history as the 


Adrian M. Altenhoff and Natasha M. Glover are the Joint first authors 


Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Methods in Molecular Biology, vol. 1910, 
https://doi.org/10.1007/978-1-4939-9074-0_5, © The Author(s) 2019 


149 


150 


Adrian M. Altenhoff et al. 


a) b) 


S; 


S2 S2 


whwh X 


Fig. 1 (a) Simple evolutionary scenario of a gene family with two speciation 
events (S4 and S2) and one duplication event (star). The type of events completely 
and unambiguously define all pairs of orthologs and paralogs: The frog gene is 
orthologous to all other genes (they coalesce at S4). The red and blue genes are 
orthologs between themselves (they coalesce at SL but paralogs between each 
other (they coalesce at star). (b) The corresponding orthology graph. The genes 
are represented here by vertices and orthology relationships by edges. The frog 
gene forms one-to-many orthology with both the human and dog genes, because 
it is orthologous to more than one sequence in each of these organisms. In such 
cases, the bi-directional best-hit approach only recovers one of the relations 
(the highest scoring one). Note that in contrary to BBH, the nonsymmetric BeTs 
approach—simply taking the best genome-wide hit for each gene regardless of 
reciprocity—would in the situation of a lost blue human gene infer an incorrect 
orthologous relation between the blue dog and red human gene 


underlying species. These days, however, the most frequent moti- 
vation for the orthology/paralogy distinction is to study and pre- 
dict gene function: it is generally believed that orthologs—because 
they were the same gene in the last common ancestor of the species 
involved—are likely to have similar biological function. By contrast, 
paralogs—because they result from duplicated genes that have been 
retained, at least partly, over the course of evolution—are believed 
to often differ in function. Consequently, orthologs are of interest 
to infer function computationally, while paralogs are commonly 
used to study function innovation. 


Box 1: Terminology 

Homology is a relation between a pair of genes that share a 
common ancestor. All pairs of genes in the below figure are 
homologous to each other. 
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Box 1: (continued) 

Orthology is a relation defined over a pair of homologous 
genes, where the two genes have emerged through a specia- 
tion event [4]. Example pairs of orthologs are (x1, y1) or (%2, 
zı). Orthologs can be further subclassified into one-to-one, 
one-to-many, many-to-one, and many-to-many orthologs. 
The qualifiers one and many indicate for each of the two 
involved genes whether they underwent an additional dupli- 
cation after the speciation between the two genomes. Hence, 
the gene pair (x), y1) is an example of a one-to-one ortholo- 
gous pair, whereas (x2, Z1) is a many-to-one ortholog relation. 

Paralogy is a relation defined over a pair of homologous 
genes that have emerged through a gene duplication, e.g., (1, 
X2) Or (X1, J2). 

In-Paralogyis a relation defined over a triplet. It involves a 
pair of genes and a speciation event of reference. A gene pair is 
an in-paralog if they are paralogs and duplicated after the 
speciation event of reference [5]. The pair (x), y2) are 
in-paralogs with respect to the speciation event $. 

Out-Paralogy is also a relation defined over a pair of genes 
and a speciation event of reference. This pair is out-paralogs if 
the duplication event through which they are related to each 
other predates the speciation event of reference. Hence, the 
pair (x1, )2) are out-paralogs with respect to the speciation 
event ba, 

Co-orthology is a relation defined over three genes, where 
two of them are in-paralogs with respect to the speciation 
event associated to the third gene. The two in-paralogous 
genes are said to be co-orthologous to the third (out-group) 
gene. Thus, xı and y are co-orthologs with respect to z. 

Homoeology is a specific type of homologous relation in a 
polyploid species, which thus contain multiple “sub-gen- 
omes.” This relation describes pairs of genes that originated 
by speciation and were brought back together in the same 
genome by allopolyploidization (hybridization) [6]. Thus, in 
the absence of rearrangement, homoeologs can be thought of 
as orthologs between sub-genomes. 


In this chapter, we first review the main methods used to infer 
orthology and paralogy, including recent techniques for scaling up 
algorithms to big data. We then discuss the problem of benchmark- 
ing orthology inference. In the last main section, we focus on 
various applications of orthology and paralogy. 
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2 Inferring Orthology 


2.1 Graph-Based 
Methods 


2.1.1 Graph Construction 
Phase: Orthology Inference 


ff et al. 


Most orthology inference methods can be classified into two major 
types: graph-based methods and tree-based methods [7 ]. Methods 
of the first type rely on graphs with genes (or proteins) as nodes and 
evolutionary relationships as edges. They infer whether these edges 
represent orthology or paralogy and build clusters of genes on the 
basis of the graph. Methods of the second type are based on gene/ 
species tree reconciliation, which is the process of annotating all 
splits of a given gene tree as duplication or speciation, given the 
phylogeny of the relevant species. From the reconciled tree, it is 
trivial to derive all pairs of orthologous and paralogous genes. All 
pairs of genes which coalesce in a speciation node are orthologs and 
paralogs if they split at a duplication node. In this section, we 
present the concepts and methods associated with the two types 
and discuss the advantages, limitations, and challenges associated 
with them. 


Graph-based approaches were originally motivated by the availabil- 
ity of complete genome sequences and the need for efficient meth- 
ods to detect orthology. They typically run in two phases: a graph 
construction phase, in which pairs of orthologous genes are 
inferred (implicitly or explicitly) and connected by edges, and a 
clustering phase, in which groups of orthologous genes are con- 
structed based on the structure of the graph. 


In its most basic form, the graph construction phase identifies 
orthologous genes by considering pairs of genomes at a time. The 
main idea is that between any given two genomes, the orthologs 
tend to be the homologs that diverged least. Why? Because assum- 
ing that speciation and duplication are the only types of branching 
events, the orthologs branched by definition at the latest possible 
time point—the speciation between the two genomes in question. 
Therefore, using sequence similarity score as surrogate measure of 
closeness, the basic approach identifies the corresponding ortholog 
of each gene through its genome-wide best hit (BeT)—the highest 
scoring match in the other genome [8]. To make the inference 
symmetric (as orthology is a symmetric relation), it is usually 
required that BeTs be reciprocal, i.e., that orthology be inferred 
for a pair of genes g; and gif and only if gz is the BeT of g; and g; is 
the BeT of g2 [9]. This symmetric variant, referred to as bi-direc- 
tional best hit (BBH), has also the merit of being more robust 
against a possible gene loss in one of the two lineages (Fig. 1). 
Inferring orthology from BBH is computationally efficient, 
because each genome pair can be processed independently and 
high-scoring alignments can be computed efficiently using dynamic 
programming [10] or heuristics such as BLAST [11]. Overall, the 
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time complexity scales quadratically in terms of the total number of 
genes (Box 2). Furthermore, the implementation of this kind of 
algorithm is simple. 


Box 2: Computational Considerations for Scaling to Many 
Genomes 

Time complexity—the amount of time for an algorithm to 
run as a function of the input—is an important consideration 
when dealing with big data. This is relevant for inferring 
orthologs and paralogs due to the massive amounts of 
sequence data. Thus, it is necessary to consider the time 
complexity of the inference algorithms, especially when scal- 
ing for large and multiple genomes. In computer science, this 
is commonly denoted in terms of “Big O” notation, which 
expresses the scaling behavior of the algorithm, up to a con- 
stant factor. Below are listed the common time complexities 
for aspects of some orthology inference algorithms, in order 
of most efficient to least efficient. 


Linear time 


e O(n): Optimal algorithm to reconcile rooted, fully 
resolved gene tree and species tree [12]; Hieranoid algo- 
rithm, which recursively merges genomes along the spe- 
cies tree to avoid all-against-all computation [13]. 


Quadratic time 


e O(n’): The all-against-all stage central to many orthol- 
ogy algorithms scales quadratically, where m is total 
number of genes. 


Cubic time 


e O(n): The COG database’s graph-based clustering 
merge triplets of homologs which share a common face 
until no more can be added. 


NP-complete 

e “Nondeterministic polynomial time,” a large class of 
algorithms for which no solution in polynomial time is 
known, (e.g. scaling exponentially with respect to the 
input size), and thus are impractical. NP-complete pro- 
blems are typically solved approximately, using heuris- 
tics. For instance, maximum likelihood gene tree 
estimation is NP-complete [14]. 


However, orthology inference by BBH has several limitations, 
which motivated the development of various improvements 
(Table 1). 
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Allowing for More Than One 
Ortholog 


Evolutionary Distances 


Differential Gene Losses 


2.1.2 Clustering Phase: 
From Pairs to Groups 
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Some genes can have more than one orthologous counterpart in a 
given genome. This happens whenever a gene undergoes duplica- 
tion after the speciation of the two genomes in question. Since 
BBH only picks the best hit, it only captures part of the ortholo- 
gous relations (Fig. 1). The existence of multiple orthologous 
counterparts is often referred to as one-to-many or many-to-many 
orthology, depending whether duplication took place in one or 
both lineages. To designate the copies resulting from such duplica- 
tions occurring after a speciation of reference, Remm et al. coined 
the term 1#-paralogs and introduced a method called InParanoid 
that improves upon BBH by potentially identifying all pairs of 
many-to-many orthologs [5]. In brief, their algorithm identifies 
all paralogs within a species that are evolutionarily closer (more 
similar) to each other than to the BBH gene in the other genome. 
This results in two sets of in-paralogs—one for each species—where 
all pairwise combinations between the two sets are orthologous 
relations. Alternatively, it is possible to identify many-to-many 
orthology by relaxing the notion of “best hit” to “group of best 
hits.” This can be implemented using a score tolerance threshold or 
a confidence interval around the BBH [23, 34]. 


Instead of using sequence similarity as a surrogate for evolutionary 
distance to identify the closest gene(s), Wall et al. proposed to use 
direct and proper maximum likelihood estimates of the evolution- 
ary distance between pairs of sequences [31]. This estimate of 
evolutionary distance is based on the number and type of amino 
acid substitutions between the two sequences. Indeed, previous 
studies have shown that the highest scoring alignment is often 
not the nearest phylogenetic neighbor [35]. Building upon this 
work, Roth et al. showed how statistical uncertainties in the dis- 
tance estimation can be incorporated into the inference 
strategy [36]. 


As discussed above, one of the advantages of BBH over BeT is that 
by virtue of the bi-directional requirement, the former is more 
robust to gene losses in one of the two lineages. But if gene losses 
occurred along both lineages, it can happen that a pair of genes 
mutually closest to one another is in fact paralogs, simply because 
both their corresponding orthologs were lost—a situation referred 
to as “differential gene losses.” Dessimoz et al. [37] presented a 
way to detect some of these cases by looking for a third species in 
which the corresponding orthologs have not been lost and thus can 
act as witnesses of non-orthology. 


The graph construction phase yields orthologous relationships 
between pairs of genes. But this is often not sufficient. Concep- 
tually, information obtained from multiple genes or organisms is 
often more powerful than that obtained from pairwise comparisons 
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only. In particular, as the use of a third genome as potential witness 
of non-orthology suggests, a more global view can allow identifica- 
tion and correction of inconsistent/spurious predictions. Practi- 
cally, it is more intuitive and convenient to work with groups of 
genes than with a list of gene pairs. Therefore, it is often desirable to 
cluster orthologous genes into groups. 

Tatusov et al. [8] introduced the concept of clusters of ortho- 
logous groups (COGs). COGs are computed by using triangles 
(triplets of genes connected to each other) as seeds and then 
merging triangles which share a common face, until no more 
triangle can be added. This clustering can be computed relatively 
efficient in time O(n), where 7 is the number of genomes analyzed 
[38]. The stated objective of this clustering procedure is to group 
genes that have diverged from a single gene in the last common 
ancestor of the species represented [8]. Practically, they have been 
found to be useful by many, most notably to categorize prokaryotic 
genes into broad functional categories. 

A different clustering approach was adopted by OrthoMCL, 
another well-established graph-based orthology inference method 
[29]. There, groups of orthologs are identified by Markov Cluster- 
ing [39]. In essence, the method consists in simulating a random 
walk on the orthology graph, where the edges are weighted accord- 
ing to similarity scores. The Markov Clustering process gives rise to 
probabilities that two genes belong to the same cluster. The graph 
is then partitioned according to these probabilities and members of 
each partition form an orthologous group. These groups contain 
orthologs and “recent” paralogous genes, where the recency of the 
paralogs can be somewhat controlled through the parameters of the 
clustering process. 

A third grouping strategy consists in building groups by iden- 
tifying fully connected subgraphs (called “cliques” in graph theory) 
[23]. This approach has the merits of straightforward interpreta- 
tion (groups of genes which are all orthologous to one another) 
and high confidence in terms of orthology within the resulting 
groups, due to the high consistency required to form a fully 
connected subgraph. But it has the drawbacks of being hard to 
compute (clique finding belongs to the NP-complete class of pro- 
blems, for which no polynomial time algorithm is known; see Box 
2) and being excessively conservative for many applications. 

As emerges from these various strategies, there is more than 
one way orthologous groups can be defined, each with different 
implications in terms of group properties and applications [40]. In 
fact, there is an inherent trade-off in partitioning the orthology 
graph into clusters of genes, because orthology is a non-transitive 
relation: if genes A and B are orthologs and genes B and C are 
orthologs, genes A and C are not necessarily orthologs, e.g., con- 
sider in Fig. 1 the blue human gene, the frog gene, and the red dog 


2.1.3 Hierarchical 
Clustering 


2.2 Tree-Based 
Methods 
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gene. Therefore, if groups are defined as sets of genes in which all 
pairs of genes are orthologs (as with OMA groups), it is not 
possible to partition A, B, and C into groups capturing all ortho- 
logous relations while leaving out all paralogous relations. 


More inclusive grouping strategies necessarily lead to orthologs 
and paralogs within the same group. Nevertheless, it can be possi- 
ble to control the nature of the paralogs included. For instance, as 
seen above, OrthoMCL attempts at including only “recent” para- 
logs in its groups. This idea can be specified more precisely by 
defining groups with respect to a particular speciation event of 
interest, e.g., the base of the mammals. Such hierarchical groups 
are expected to include orthologs and in-paralogs with respect to 
the reference speciation—in our example all copies that have des- 
cended from a single common ancestor gene in the last mammalian 
common ancestor. Conceptually, hierarchical orthologous groups 
can be defined as groups of genes that have descended from a single 
common ancestral gene within a taxonomic range of interest. 

Several resources provide hierarchical clustering of orthologous 
groups. EggNOG [15] and OrthoDB [25], for example, both 
implement this concept by applying a COG-like clustering method 
for various taxonomic ranges. Another example, Hieranoid, pro- 
duces hierarchical groups by using a guide tree to perform pairwise 
orthology inferences at each node from the leaves to the root— 
inferring ancestral genomes at each node in the tree [13, 18]. Simi- 
larly, OMA GETHOGs is an approach based on an orthology graph 
of pairwise orthologous gene relations, where hierarchical ortho- 
logous groups are formed starting with the most specific taxonomy 
and incrementally merges them toward the root [21, 22]. Another 
method, COCO-CL, identifies hierarchical orthologous groups 
recursively, using correlations of similarity scores among homolo- 
gous genes [41] and, interestingly, without relying on a species 
tree. By capturing part of the gene tree structure in the group 
hierarchies, these methods try in some way to bridge the gap 
between graph-based and tree-based orthology inference 
approaches. We now turn our attention to the latter. 


At their core, tree-based methods infer orthologs on the basis of 
gene family trees whose internal nodes are labeled as speciation or 
duplication nodes. Indeed, once all nodes of the gene tree have 
been inferred as a speciation or duplication event, it is trivial to 
establish whether a pair of genes is orthologous or paralogous, 
based on the type of the branching where they coalesce. Such 
labeling is traditionally obtained by reconciling gene and species 
trees. In most cases, gene and species trees have different topolo- 
gies, due to evolutionary events acting specifically on genes such as 
duplications, losses, lateral transfers, or incomplete lineage sorting 
[42]. Goodman et al. [43] pioneered research to resolve these 
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Fig. 2 Schematic example of the gene/species tree reconciliation. The gene tree 
and species tree are not compatible. Reconciliation methods resolve the 
incongruence between the two by inferring speciation, duplication, and losses 
events on the gene tree. The reconciled tree indicates the most parsimonious 
history of this gene, constrained to the species tree. The simple representation 
(bottom right) suggests that the human and frog genes are orthologs and that 
they are both paralogous to the dog gene 


incongruences. They showed how the incongruences can be 
explained in terms of speciation, duplication, and loss events on 
the gene tree (Fig. 2) and provided an algorithm to infer such 
events. 

Most tree reconciliation methods rely on a parsimony criterion: 
the most likely reconciliation is the one which requires the least 
number of gene duplications and losses. This makes it possible to 
compute reconciliation efficiently and is tenable as long as duplica- 
tion and loss events are rare compared to speciation events. In their 
seminal article, Goodman et al. [43] had already devised their 
reconciliation algorithm under a parsimony strategy. In the 
subsequent years, the problem was formalized in terms of a map 
function between the gene and species trees [44], whose computa- 
tional cost was conjectured [45], and later proved [12, 46] to 
coincide with the number of gene duplication and losses. These 
results yielded highly efficient algorithms, either in terms of asymp- 
totic time complexity [12] or in terms of runtimes on typical 
problem sizes [47]. With these near-optimal solutions, one might 
think that the tree reconciliation problem has long been solved. As 
we shall see in the rest of this section, however, the original formu- 
lation of the tree reconciliation problem has several limitations in 
practice, which have stimulated the development of various refine- 
ments to overcome them (Table 2). 
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2.2.1 Unresolved 
Species Tree 


2.2.2 Rooting 


A first problem ignored by most early reconciliation algorithms lies 
in the uncertainty often associated with the species tree, which 
these methods assume as correct and heavily rely upon. 

One way of dealing with the uncertainties is to treat unresolved 
parts of the species tree as multifurcating nodes (also known as soft 
polytomies). By doing so, the reconciliation algorithm is not forced 
to choose for a specific type of evolutionary event in ambiguous 
regions of the tree. This approach is, for instance, implemented in 
TreeBeST [52] and used in the Ensembl Compara project [53]. 

Alternatively, Heijden et al. [57] demonstrated that it is often 
possible to infer speciation and duplication events on a gene tree 
without knowledge of the species tree. Their approach, which they 
call species overlap, identifies for a given split the species represented 
in the two subtrees induced by the split. If at least one species has 
genes in both subtrees, a duplication event is inferred; else a specia- 
tion event is inferred. In fact, this approach is a special case of soft 
polytomies where all internal nodes have been collapsed. Thus, the 
only information needed for this approach is a rooted gene tree. 
Since then, this approach has been adopted in other projects, such 
as PhylomeDB [59]. 


The classical reconciliation formulation requires both gene and 
species trees to be rooted. But most models of sequence evolution 
are time reversible and thus do not allow to infer the rooting of the 
reconstructed gene tree. One sensible solution is to root a gene tree 
so that it minimizes the number of duplication events [62]. Thus, 
this method uses the parsimony principle for both rooting and 
reconciliation. For cases of multiple optimal rootings, ties can be 
broken by selecting the tree that minimizes the tree height [63] or 
by picking the rooting which minimizes the number of gene 
losses [61]. 

Another approach is to place the root at the “center of the 
tree”—also known as “midpoint rooting” [58]. The idea of this 
method goes back to Farris [64] and is motivated by the concept of 
a molecular clock. But for most gene families, assuming a constant 
rate of evolution is inappropriate [65, 66], and thus this approach is 
not used widely. A newly introduced refinement based on minimiz- 
ing average deviations among children nodes holds promise of 
being more robust [67] but still relies on a molecular clock 
assumption. 

For the species tree, the most common and reliable way of 
rooting trees is by identifying an outgroup species. PhylomeDB 
uses genes from outgroup species to root gene trees [59]. One 
main potential problem with this approach is that in many situa- 
tions, it can be difficult to identify a suitable outgroup. For exam- 
ple, in analysis covering all kingdoms of life, an outgroup species 
may not be available, or the relevant genes might have been lost 


2.2.3 Gene Tree 
Uncertainty 


2.2.4 Parsimony 
vs. Likelihood 
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[68]. A suitable out-group needs to be close enough to allow for 
reliable sequence alignment, yet it must have speciated clearly 
before any other species separated. Furthermore, ancient duplica- 
tions can cause outgroup species to carry in-group genes. These 
difficulties make this approach more challenging for automated, 
large-scale analysis [69 ]. 


Another assumption made in the original tree reconciliation prob- 
lem is the (topological) correctness of the gene tree. But it has been 
shown that this assumption is commonly violated, often due to 
finite sequence lengths, taxon sampling [70, 71], or gene evolution 
model violations [72 ]. On the other hand, techniques of expressing 
uncertainties in gene tree reconstruction via support measures, e.g., 
bootstrap values, have become well established. Storm and Sonn- 
hammer [58] as well as Zmasek and Eddy [63] independently 
suggested to extend the bootstrap procedure to reconciliation, 
thereby reducing the dependency of the reconciliation procedure 
on any one gene tree while providing a measure of support of the 
inferred speciation/duplication events. The downsides of using the 
bootstrap are the high computational costs and interpretation dif- 
ficulties associated with it [73]. 

Similarly to how unresolved species tree can be handled, unre- 
solved parts of the gene tree can also be collapsed into multifurcat- 
ing nodes. For instance, HOGENOM [55] and Softparsmap [61] 
collapse branches with low bootstrap support values. 

A third way of tackling this problem consists in simultaneously 
solving both the gene tree reconstruction and reconciliation pro- 
blems [74]. They use the parsimony criterion of minimizing the 
number of duplication events to improve on the gene tree itself. 
This is achieved by rearranging the local gene tree topology of 
regions with low bootstrap support such that the number of dupli- 
cations and losses is further reduced. 


All the approaches mentioned so far try to minimize the number of 
gene duplication events. This is generally justified by a parsimony 
argument, which assumes that gene duplications and losses are rare 
events. But what if this assumption is frequently violated? Little is 
known about duplication and loss rates in general [75], but there is 
strong evidence for historical periods with high gene duplication 
occurrence rates [76] or gene families specifically prone to massive 
duplications (e.g., olfactory receptor, opsins, serine/threonine 
kinases, etc.) 

Motivated by this reasoning, Arvestad et al. introduced the idea 
of a probabilistic model for tree reconciliation [49]. They used a 
Bayesian approach to estimate the posterior probabilities of a rec- 
onciliation between a given gene and species tree using Markov 
chain Monte Carlo (MCMC) techniques. Arvestad et al. [49] 
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2.3 Graph-Based 
vs. Tree-Based: Which 
Is Better? 


modeled gene duplication and loss events through a lirth-death 
process [77]. In the subsequent years, they refined their method to 
also model sequence evolution and substitution rates in a unified 
framework called gene sequence evolution model with iid rates (GSR) 
[49, 50]. 

Perhaps the biggest problem with the probabilistic approach is 
that it is not clear how well the assumptions of their model (the 
birth-death process with fixed parameters) relate to the true process 
of gene duplication and gene loss. Doyon et al. [78 ] compared the 
maximum parsimony reconciliation trees from 1278 fungi gene 
families to the probabilistically reconciled trees using gene birth/ 
death rates fitted from the data. They found that in all but two 
cases, the maximum parsimony scenario corresponds to the most 
probable one. This remarkably high level of consistency indicates 
that in terms of the accuracy of the “best” reconciliation, there is 
little to gain from using a likelihood approach over the parsimony 
criterion of minimizing the number of duplication events. But how 
this result generalizes to other datasets has yet to be investigated. 


Given the two fundamentally different paradigms in orthology 
inference that we reviewed in this section, one can wonder which 
is better. Conceptually, tree reconciliation methods have several 
advantages. In terms of inference, by considering all sequences 
from all species at the same time, it can also be expected that they 
can extract more information from the sequences. This in turn 
should translate into higher statistical power. In terms of their 
output, reconciled gene trees provide the user more information 
than pairs or groups of orthologs. For example, the trees display the 
order of duplication and speciation events, as well as evolutionary 
distances between these events. In practice, however, these meth- 
ods have the disadvantage of having much higher computational 
complexity than their graph-based counterparts. Furthermore, the 
two approaches are in practice often not that strictly separated. 
Tree-based methods often start with a graph-based clustering step 
to identify families of homologous genes. Conversely, several hier- 
archical grouping algorithms also rely on species trees in their 
inference. 

Thus, it is difficult to make general statements about the rela- 
tive performance of the two classes of inference methods. One 
solution that can leverage the unique abilities of both tree-based 
and graph-based methods is to combine several independent 
orthology inference methods into one. We discuss this technique 
in the next section. 


3 Meta-methods 
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In recent years a new class of orthology inference tools has emerged 
which attempts to make the most out of multiple orthology predic- 
tion algorithms—meta-methods. These are approaches which com- 
bine several individual and distinct methods in order to produce 
more robust orthology predictions. These meta-methods are able 
to take advantage of the standardized formats of output which has 
been a goal of the orthology community [79], as well as the many 
new and well-established methods out there. 

Generally, meta-methods assign a confidence score to a given 
predicted orthologous relation. In its most basic form, more weight 
is given to orthologs predicted by the most methods. Some exam- 
ples include methods which simply take the intersection of several 
methods, such as GET_HOMOLOGUES [80], COMPARE [81], 
HCOP [82], and DIOPT [83]. These methods maintain a high 
level of precision, but since they are based on intersections, they 
necessarily have a lower recall. 

Additionally, post-processing techniques can be used to build 
upon the base of orthologs found by several methods—thus assign- 
ing more sequences as orthologs and improving performance. For 
example, MOSAIC (Multiple Orthologous Sequence Analysis and 
Integration by Cluster optimization) [84] uses an iterative graph- 
based optimization approach that works on ortholog sets predicted 
by several independent methods. MOSAIC captures orthologs 
which are missed by some individual methods, producing a 1.6- 
fold increase in the number of orthologs detected. Another exam- 
ple is the MARIO software, which looks for the intersection of 
several different orthology methods as seed groups and then pro- 
gressively adds unassigned proteins to the groups based on HMM 
profiles [85]. MetaPhOrs’ approach integrates phylogenetic and 
homology information derived from different databases 
[86]. They demonstrate that the number of independent sources 
from which an orthology prediction is made, as well as the level of 
consistency across predictions, can be used as confidence scores. 

So far the previously mentioned meta-methods combine inde- 
pendent orthology prediction algorithms and give a higher score 
based on the more algorithms which predict a given orthologous 
relation. However, another emerging approach is to use machine 
learning techniques to recognize patterns among several different 
orthology inference methods. With this, one can predict previously 
unknown high-confidence orthologs. WORMHOLE is a tool 
which uses the information from 17 different orthology prediction 
methods to train support vector machine classifiers for predicting 
least diverged orthologs [87]. WORMHOLE was able to strongly 
re-predict least diverged orthologs in the reference set and also 
predict previously unclassified orthologous genes. 
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The type of meta-approach and its associated stringency 
depends on what the user is going after. For example, if the goal 
is to get very-high-confidence groups, methods which only com- 
bine for the intersection without trying to add more orthologs may 
be preferable. Studies requiring both high precision and recall may 
be better suited to use the meta-methods which use post- 
processing or machine learning to predict orthologs. And as with 
all methods, it is important to understand which clades the method 
has been benchmarked in and which orthology tools have been 
combined. For example, if several methods have the same bias, one 
will just propagate the bias and end up with a false sense of security 
because the methods are not independent. 


4 Scaling to Many Genomes 


In terms of orthology inference, the abundance of genomes now 
available has resulted in an emphasis on driving down computa- 
tional processing time via efficient algorithms. When inferring 
orthology for many genomes, the bottleneck is generally the all- 
against-all computations—aligning the proteins in every genome 
against the proteins in every other genome. This is the first step of 
nearly all graph-based methods. The all-against-all computation has 
an O(n’) runtime, meaning it scales quadratically with the number 
of genomes analyzed (Box 2). 

So far, two main techniques for scaling orthology prediction to 
many genomes have emerged. The first approach is by making the 
all-against-all comparisons faster. Because comparisons are inde- 
pendent of each other, the most obvious way of doing this is by 
taking advantage of a high-performance computing cluster, as this 
is an embarrassingly parallel computing problem. Many methods 
have implemented this, such as Hieranoid [13], PorthoMCL [88], 
or OMA [22]. Another way to save time on the all-against-all 
comparisons is by using very fast algorithms for the homology 
search. For example, preliminary results of SonicParanoid showed 
160-750 speedup of orthology inference compared to InPara- 
noid [89]. Innovations in alignment algorithms with methods such 
as DIAMOND [90] or MMSeq2 [91] have the potential to greatly 
reduce the time to do the all-against-all comparisons. 

A second approach to efficiently scale up orthology inference to 
many genomes is by simply avoiding doing the entire all-against-all 
comparisons. This makes sense, since a significant amount of time is 
spent comparing unrelated gene pairs. For example, it is possible to 
avoid aligning many unrelated pairs by exploiting the transitive 
property of homology. Wittwer et al. [92] did this by first building 
clusters of homologous sequences with one representative 
sequence per cluster and subsequently performing the all-against- 
all within each cluster. Hieranoid avoids unnecessary all-against-all 
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comparisons by using a species tree as a guide, reducing the number 
of comparisons to N — 1 for N genomes, scaling linearly rather than 
quadratically [18]. Another way to avoid all-by-all comparison is by 
using a mapping strategy, whereby new proteomes are mapped 
onto precomputed orthologous groups. This strategy has been 
successfully implemented with the eggNOG database—each 
sequence in a new proteome is mapped to a precomputed ortholo- 
gous cluster based on hidden Markov models. Then, orthology 
relations and function are transferred to the new sequence from 
the best matching sequence in the database [93]. 


5 Benchmarking Orthology 


5.1 Benchmarking 
Approaches 


5.1.1 Functional 
Conservation 


Assessing the quality of orthology predictions is important but 
difficult. The main challenge is that the precise evolutionary history 
of entire genomes is largely unknown and thus, predictions can 
only be validated indirectly, using surrogate measures. To be infor- 
mative, such measures need to strongly correlate with orthology/ 
paralogy. At the same time, they should be independent from the 
methods used in the orthology inference process. Concretely, this 
means that the orthology inference is not based on the surrogate 
measure and the surrogate measure is not derived from orthology/ 


paralogy. 


Several ways of benchmarking orthology inference have been devel- 
oped in the past years. In the next sections, we go over the main 
approaches, bringing attention to the advantages and limitations 
to each. 


The first surrogate measures proposed revolved around conserva- 
tion of function [94]. This was motivated by the common belief 
that orthologs tend to have conserved function, while paralogs 
tend to have different functions. Indeed, orthologs tend to be 
more conserved than paralogs in terms of GO annotation similarity 
[95]. Thus, “for a given evolutionary distance, more accurate 
orthology inference is likely to be correlated with more functionally 
similar gene pairs.” Hulsen et al. [94] assessed the quality of ortho- 
log predictions in terms of conservation of co-expression levels, 
domain annotation, and protein-protein interaction partners. 
Additionally, Altenhoff et al. [96] used similarity of experimentally 
validated GO annotations as well as Enzyme Commission 
(EC) numbers as a functional benchmark. Functional benchmarks 
have an advantage in that many researchers are interested in orthol- 
ogy because they want to find functionally conserved genes, thus 
making functional tests important for assessing different inference 
methods. The main limitation of these measures is that it is not so 
clear how much they correlate with orthology/paralogy. Indeed, it 
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5.1.2 Gene 
Neighborhood 
Conservation 


5.1.3 Species Tree 
Discordance Test 


5.1.4 Gold Standard 
Gene Tree Test 


has been argued that the difference in function conservation trends 
between orthologs and paralogs might be much smaller than com- 
monly assumed and indeed many examples are known of orthologs 
that have dramatically different functions [97 ]. 


The fraction of orthologs that have neighboring genes being ortho- 
logs themselves is an indicator of consistency and therefore to some 
extent also of quality of orthology predictions [94]. Although 
synteny has been used as part of the orthology inference for several 
algorithms, to date it has not been used as part of large-scale 
benchmarking efforts. One possible problem is that gene neighbor- 
hood can be conserved among paralogs, such as those resulting 
from whole-genome duplications. Furthermore, some methods use 
gene neighborhood conservation to help in their inference process, 
which can bias the assessment done on such measures (principle of 
independence stated above). 


The quality of ortholog predictions can also be assessed based on 
phylogeny. By definition, the tree relating a set of genes all ortho- 
logous to one another only contains speciation splits and has the 
same topology as the underlying species. We introduced a bench- 
marking protocol that quantifies how well the predictions from 
various orthology inference methods agree with undisputed species 
tree topologies [96, 98]. Thus, the species tree discordance test 
judges the accuracy of ortholog predictions based on the correct- 
ness of the species tree which can be constructed from them. 
The advantage of this measure is that by virtue of directly ensuing 
from the definition of orthology, it correlates strongly with it and 
thus satisfies the first principle. However, the second principle, 
independence from the inference process, is not satisfied with 
methods relying on the species tree—typically all reconciliation 
methods but also most graph-based methods producing hierarchi- 
cal groups. In such cases, interpretation of the results must be done 
carefully. 


High-quality reference gene trees can also be used to assess 
orthology inferences. For this, one compares the pairs of ortho- 
logs from a given method to pairs of orthologs derived from these 
expertly curated gene trees [40, 99]. One drawback of this bench- 
mark is that it is limited by the ability to curate the phylogeny—if 
the evolutionary history of the gene family is ambiguous, the 
resulting reference tree will unavoidably have mistakes. Another 
limitation is the small size of most benchmarks of this type. This 
casts doubts on their generalizability and makes them prone to 
overfitting. 


5.1.5  Subtree 
Consistency Test 


5.1.6 Latent Class 
Analysis 


5.1.7 Simulated 
Genomes 


5.2 Orthology 
Benchmarking Service 
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For inference methods based on reconciliation between gene and 
species trees, Vilella et al. [53] proposed a different phylogeny- 
based assessment scheme. For any duplication node of the labeled 
gene tree, a consistency score is computed, which captures the 
balance of the species found in the two subtrees. Unbalanced 
nodes correspond to an evolutionary scenario involving extensive 
gene losses and therefore, under the principle of parsimony, are less 
likely to be correct. Given that studies to date tend to support the 
adequacy of the parsimony criterion in the context of gene family 
dynamics (Subheading 2.2.4), it can be expected that this metric 
correlates highly with correct orthology/paralogy assignments. 
However, since virtually all tree-based methods themselves incor- 
porate this very criterion in their objective function (i.e., minimiz- 
ing the number of gene duplications and losses), the principle of 
independence is violated, and thus the adequacy of this measure is 
questionable. 


Chen et al. [100] proposed a purely statistical benchmark based on 
latent class analysis (LCA). Given the absence of a definitive answer 
on whether two given genes are orthologs, the authors argue that 
by looking at the agreement and disagreement of predictions made 
by several inference methods on a common dataset, one can esti- 
mate the reliability of individual predictors. More precisely, LCA is 
a statistical technique that computes maximum likelihood estimates 
of sensitivity and specificity rates for each orthology inference 
methods, given their predictions and given an error model. This is 
attractive, because it does not depend on any surrogate measure. 
However, the results depend on the error model assumed. Thus, we 
are of the opinion that LCA merely shifts the problem of assessing 
orthology to the problem of assessing an error model of various 
orthology inference methods. 


Finally, simulated data can be used in benchmarking. By this, the 
precise evolutionary history of a genome can be validated, in terms 
of gene duplication, insertion, deletion, and lateral gene transfer 
[101]. Knowing for certain all aspects of the simulated genomes 
gives an advantage over assessments based on empirical data, where 
the true evolutionary history is unknown. On the other hand, how 
well the simulated data reflect “real” data is debatable. 


The orthology benchmarking service is a web-based platform for 
which users can upload their ortholog predictions and run them 
through a variety of benchmarks. The user must use quest for 
orthologs (QFO) reference proteome set, which is a set of 66 gen- 
omes that covers a diverse set of species across all domains [79], to 
infer pairwise or groups of orthologs. Several phylogenetic and 
function-based benchmarks are automatically run on the uploaded 
data, and then summary statistics of the results of each benchmark 
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5.3 Conclusions on 
Benchmarking 


6 Applications 


are reported. The user can compare their method’s performance 
with that of other well-known orthology inference algorithms and 
choose to make theirs public as well. For each benchmark, a 
precision-recall curve is reported, allowing for ease of comparison 
and evaluation of individual inference techniques. Because of the 
range of benchmarking tests and publicly available methods for 
comparison, the benchmarking service is useful for both users, 
who can check which methods work well for their particular prob- 
lem and for method developers. The orthology benchmarking 
service can be accessed at http: //orthology.benchmarkservice.org. 


Overall, it becomes apparent that there is no “magic bullet” strat- 
egy for orthology benchmarking, as each approach discussed here 
has its limitations (though some limitations are more serious than 
others). Nevertheless, comparative studies based on these various 
benchmarking measures have reported surprisingly consistent find- 
ings [40, 94, 96, 98, 100]: these assessments generally observe that 
there is a trade-off between accuracy and coverage and most com- 
mon databases are situated on a Pareto frontier. The various assess- 
ments concur that the “best” orthology approach is highly 
dependent on the various possible applications of orthology. 


As we have seen so far, there is a large diversity in the methods for 
orthology inference. The main reason is that, although the meth- 
ods discussed here all infer orthology as part of their process, many 
of them have been developed for different reasons and have differ- 
ent ultimate goals. Unfortunately, this is often not mentioned 
explicitly and tends to be a source of confusion. In this section, 
we review some of these ultimate goals and discuss which methods 
and representation of orthology are better suited to address them 
and why. 

As mentioned in the introduction, most interest for orthology 
is in the context of function prediction and is largely based on the 
belief that orthologs tend to have conserved function. A conserva- 
tive approach consists in propagating function between one-to-one 
orthologs, i.e., pairs of orthologous genes that have not undergone 
gene duplication since they diverged from one another. Several 
orthology databases directly provide one-to-one orthology predic- 
tions. But even with those that do not, it might still be possible to 
obtain such predictions, for instance, by selecting hierarchical 
groups containing at most one sequence in each species or by 
extracting from reconciled trees’ subtrees with no duplication. A 
more sophisticated approach consists in propagating gene function 
annotations across genomes on the basis of the full reconciled gene 
tree. Thomas et al. [102], for instance, proposed a way to assign 
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gene function to uncharacterized proteins using a gene tree and a 
hidden Markov model (HMM) among gene families. Engelhardt 
et al.. [103] developed a Bayesian model of function change along 
reconciled gene trees and showed that their approach significantly 
improves upon several methods based on pairwise gene function 
propagation. Ensembl Compara [53] and Panther [102] are two 
major databases providing reconciled gene trees. 

Since Darwin, one traditional question in biology has always 
been how species are related to each other. As we recall in the 
introduction of this chapter, Fitch’s original motivation for defin- 
ing orthology was phylogenetic inference. Indeed, the gene tree 
reconstructed from a set of genes which are all orthologous to each 
other should by definition be congruent to the species tree. OMA 
Groups (OMA) have this characteristic and, crucially, are con- 
structed without help of a species tree. 

Yet another application associated with orthology are general 
alignments between genomes, e.g., protein-protein interaction 
(PPI) network alignments or whole-genome alignments. Finding 
an optimal PPI network alignment between two genomes on the 
basis of the network topology alone is a computationally hard 
problem (i.e., it is an instance of the subgraph isomorphism prob- 
lem which is NP-complete [104]). Orthology is often used as 
heuristic to constrain the mapping of the corresponding genes 
between the two networks and thus to reduce the problem com- 
plexity of aligning networks [105]. For whole-genome alignments, 
people most often use homologous regions and use orthologs as 
anchor points [106]. These types of application typically rely on 
ortholog predictions between pairs of genomes, as provided, e.g., 
by InParanoid [5] or OMA [23]. 


7 Conclusions and Outlook 


The distinction between orthologs and paralogs is at the heart of 
many comparative genomic studies and applications. The original 
and generally accepted definition of orthology is based on the 
evolutionary history of pairs of genes. By contrast, there is a con- 
siderable diversity in how groups of orthologs are defined. These 
differences largely stem from the fact that orthology is a 
non-transitive relation and therefore, dividing genes into ortholo- 
gous groups will either miss or wrongly include orthologous rela- 
tions. This makes it important and worthwhile to identify the type 
of orthologous group best suited for a given application. 
Regarding inference methods, while most approaches can be 
ordered into two fundamental paradigms—graph-based and tree- 
based—the difference between the two is shrinking, with graph- 
based methods increasingly striving to capture more of the evolu- 
tionary history. On the other hand, the rapid pace at which new 
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8 Exercises 


genomes are sequenced limits the applicability of tree-based meth- 
ods, computationally more demanding. 

Benchmarking this large variety of methods remains a hard 
problem—from a conceptual point as described above but also 
because of very practical challenges such as heterogeneous data 
formats, genome versions, or gene identifiers. This has been recog- 
nized by the research community and has led to the development of 
the QFO consortium benchmarking service [96]. 

Looking forward, we see potential in extending the current 
model of gene evolution, which is limited to speciation, duplica- 
tion, and loss events. Indeed, nature is often much more compli- 
cated. For instance, lateral gene transfer (LGT) is believed to be a 
major mode of evolution in prokaryotes. While there has been 
several attempts at extending tree reconciliation algorithms to 
detecting LGT [107, 108], the problem is largely unaddressed in 
typical orthology resources [109]. Another relevant evolutionary 
process omitted by most methods is whole-genome duplications 
(WGD). Even though WGD events act jointly on all gene families, 
with few exceptions [110, 111], most methods consider each gene 
family independently. 

Overall, the orthology/paralogy dichotomy has proved to be 
useful but also inherently limited. Reducing the whole evolutionary 
history of homologous genes into binary pairwise relations is 
bound to be a simplification—and at times an oversimplification. 
The shift toward hierarchical orthologous groups is thus a 
promising step toward capturing more features of the evolutionary 
history of genes. Yet further development will still be needed, as we 
are nowhere close to grasp the formidable complexity of gene 
evolution across the full diversity of life. 


Assume the following evolutionary scenario 


E: 
A B CDE F 


where duplications are depicted as star and all other splits are 
speciations. 


Problem #1: Draw the corresponding orthology graph, where the 
vertices correspond to the observed genes and the edges indi- 
cate orthologous relations between them. 


Problem #2: Apply the following two clustering methods on your 
orthology graph. First, reconstruct all the maximal fully 
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Abstract 


Most genomes are populated by hundreds of thousands of sequences originated from mobile elements. On 
the one hand, these sequences present a real challenge in the process of genome analysis and annotation. On 
the other hand, they are very interesting biological subjects involved in many cellular processes. Here we 
present an overview of transposable elements biodiversity, and we discuss different approaches to transpo- 
sable elements detection and analyses. 
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analysis, Genome evolution 


1 Introduction 


Most eukaryotic genomes contain large numbers of repetitive 
sequences. This phenomenon was described by Waring and Britten 
a half century ago using reassociation studies [1, 2]. It turned out 
that most of these repetitive sequences originated in transposable 
elements (TEs) [3], though the repetitive fraction of a genome 
varies significantly between different organisms, from 12% in Cae- 
norhabditis elegans [4] to 50% in mammals [3], and more than 80% 
in some plants [5]. With such large contributions to genome 
sequences, it is not surprising that TEs have a significant influence 
on the genome organization and evolution. Although much prog- 
ress has been achieved in understanding the role TEs play in a host 
genome, we are still far from the comprehensive picture of the 
delicate evolutionary interplay between a host genome and the 
invaders. They also pose various challenges to the genomic com- 
munity, including aspects related to their detection and classifica- 
tion, genome assembly and annotation, genome comparisons, and 
mapping of genomic variants. They also pose various challenges to 
the genomic community, including aspects related to their detec- 
tion and classification, genome assembly and annotation, genome 
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comparisons, and mapping of genomic variants. Here we present an 
overview of TE diversity and discuss major techniques used in their 
analyses. 


2 Discovery of Mobile Elements 


Transposable elements were discovered by Barbara McClintock 
during experiments conducted in 1944 on maize. Since they 
appeared to influence phenotypic traits, she named them 
controlling elements. However, her discovery was met with less 
than enthusiastic reception by the genetic community. Her presen- 
tation at the 1951 Cold Spring Harbor Symposium was not under- 
stood and at least not very well received [6]. She had no better luck 
with her follow-up publications [7-9] and after several years of 
frustration decided not to publish on the subject for the next two 
decades. Not for the first time in the history of science, an unap- 
preciated discovery was brought back to life after some other 
discovery has been made. In this case it was the discovery of 
insertion sequences (IS) in bacteria by Szybalski group in the early 
1970s [10]. In the original paper they wrote: “Genetic elements 
were found in higher organisms which appear to be readily trans- 
posed from one to another site in the genome. Such elements, 
identifiable by their controlling functions, were described by 
McClintock in maize. It is possible that they might be somehow 
analogous to the presently studied IS insertions” [10]. The impor- 
tance of McClintock’s original work was eventually appreciated by 
the genetic community with numerous awards, including 14 hon- 
orary doctoral degrees and a Nobel Prize in 1983 “for her discovery 
of mobile genetic elements” (http://nobelprize.org/nobel_ 
prizes/medicine/laureates/1983/). 

Coincidently, at the same time as Szybalski “rediscovered” TEs, 
Susumu Ohno popularized the term junk DNA that influenced 
genomic field for decades [11], although the term itself was used 
already before [12, 13]. Ohno referred to the so-called noncoding 
sequences or, to be more precise, to any piece of DNA that do not 
code for a protein, which included all genomic pieces originated in 
transposons. The unfavorable picture of transposable and trans- 
posed elements started to change in early 1990s when some 
researchers noticed evolutionary value of these elements 
[14, 15]. With the wheel of fortune turning full circle and advances 
of genome sciences, TE research is again focused on the role of 
mobile elements played in the evolution of gene regulation 
[16-23]. 


l The historical background of the “junk DNA” term was recently discussed by Dan Graur in his excellent blog 
http: //judgestarling.tumblr.com/post/64504735261 /the-origin-of-the-term-junk-dna-a-historical 
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3 Transposons Classification 


3.1 Insertion 
Sequences and Other 
Bacterial Transposons 


The bacterial genome is composed of a core genomic backbone 
decorated with a variety of multifarious functional elements. These 
include mobile genetic elements (MGEs) such as bacteriophages, 
conjugative transposons, integrons, unit transposons, composite trans- 
posons, and insertion sequences (IS). Here we elaborate upon the last 
class of these elements as they are most widely found and 
described [24]. 

The ISs were identified during studies of model genetic systems 
by virtue of their capacity to generate mutations as a result of their 
translocation [10]. In-depth studies in antibiotic resistance and 
transmissible plasmids revealed an important role for these mobile 
elements in formation of resistance genes and promoting gene 
capture. In particular, it was observed that several different ele- 
ments were often clustered in “islands” within plasmid genomes 
and served to promote plasmid integration and excision. 

Although these elements sometimes generate beneficial muta- 
tions, they may be considered genomic parasites as ISs code only for 
the enzyme required for their own transposition [24]. While an IS 
element occupies a chromosomal location, it is inherited along with 
its host’s native genes, so its fitness is closely tied to that of its host. 
Consequently, ISs causing deleterious mutations that disrupt a 
genomic mode or function are quickly eliminated from the popula- 
tion. However, intergenically placed ISs have a higher chance to be 
fixed in the population as they are likely neutral regarding popula- 
tion’s fitness [25]. 

ISs are generally compact (Fig. 1). They usually carry no other 
functions than those involved in their mobility. These elements 
contain recombinationally active sequences which define the 
boundary of the element, together with Tpase, an enzyme, which 
processes these ends and whose gene usually encompasses the 
entire length of the element [26]. Majority of ISs exhibit short 
terminal inverted-repeat sequences (IR) of length 10-40 bp. Sev- 
eral notable exceptions do exist, for example, the IS91, IS110, and 
1S200/605 families. 

The IRs contain two functional domains [27 ]. One is involved 
in Tpase binding; the other cleaves and transfers strand-specific 
reactions resulting in transposition. IS promoters are often posi- 
tioned partially within the IR sequence upstream of the Tpase gene. 
Binding sites for host-specific proteins are often located within 
proximity to the terminal IRs and play a role in modulating trans- 
position activity or Tpase expression [28]. A general pattern for the 
functional organization of Tpases has emerged from the limited 
numbers analyzed. The N-terminal region contains sequence- 
specific DNA binding activities of the proteins while the catalytic 
domain is often localized toward the C-terminal end [28]. 
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3.2 Eukaryotic 
Transposable 
Elements 


3.2.1 Class |: Mobile 
Elements 
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Fig. 1 Schematic representation of insertion sequences (IS). dr direct repeats, /R 
inverted repeats, ORF open reading frame 


Another common feature of ISs is duplication of a target site 
that results in short direct repeats (DRs) flanking the IS [29]. The 
length of the direct repeat varies from 2 to 14 base pairs and is a 
hallmark of a given element. Homologous recombination between 
two IS elements can result in each having two different DRs [30]. 

ISs have been classified on the basis of (1) similarities in genetic 
organization (arrangement of open reading frames); (2) marked 
identities or similarities in their Tpases (common domains or 
motifs); (3) similar features of their ends (terminal IRs); and 
(4) fate of the nucleotide sequence of their target sites (generation 
of a direct target duplication of determined length). Based on the 
above rules, ISs are currently classified in 30 families (Table 1) [31]. 


The first TE classification system was proposed by Finnegan in 
1989 [32] and distinguished two classes of TEs characterized by 
their transposition intermediate: RNA (class I or retrotransposons) 
or DNA (class II or DNA transposons). The transposition mecha- 
nism of class I is commonly called “copy and paste” and that of class 
II, “cut and paste.” In 2007 Wicker et al. [33] proposed hierarchi- 
cal classification based on TEs structural characteristics and mode of 
replication (see Table 2 and Fig. 2). Below we present a brief 
overview of eukaryotic mobile elements that in general follows 
this classification. 


As mentioned above, class I TEs transpose through an RNA inter- 
mediary. The RNA intermediate is transcribed from genomic DNA 
and then reverse-transcribed into DNA by a TE-encoded reverse 
transcriptase (RT), followed by reintegration into a genome. Each 
replication cycle produces one new copy, and as a result, class I 
elements are the major contributors to the repetitive fraction in 
large genomes. Retrotransposons are divided into five orders: LTR 
retrotransposons, DIRS-like elements, Penelope-like elements 
(PLEs), LINEs (long interspersed elements), and SINEs (short 
interspersed elements). This scheme is based on the mechanistic 
features, organization, and reverse transcriptase phylogeny of these 
retroelements. Accidentally, the retrotranscriptase coded by an 
autonomous TE can reverse-transcribe another RNA present in 
the cell, e.g., mRNA, and produce a retrocopy of it, which in 
most cases results in a pseudogene. 

The LTR retrotransposons are characterized by the presence of 
Jong terminal repeats (LTRs) ranging from several hundred to 
several thousand base pairs. Both exogenous retroviruses and 
LTR retrotransposons contain a gag gene that encodes a viral 
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Table 1 
Prokaryotic transposable elements as presented in the /S Finder database [31] 
Family Typical size range in bp Direct repeat size in bp IRs? Number of ORFs 
ISJ 740-4600 0-10 X: lor2 
IS110 1200-1550 0 Y l 
IS1182 1330-1950 0-60 Y 1 
1S 1380 1550-2000 4-5 Y 1 
TS1595 700-7900 8 Y 1 
1S 1634 1500-2000 5-6 Y 1 
IS200/1S605 600-2000 0 Y/N lor2 
IS2] 1750-2600 4-8 Y 2 
18256 1200-1500 8-9 X. 1 
IS3 1150-1750 5 NG 2 
IS30 1000-1700 2-3 Y l 
IS4 1150-5400 8-13 Y l or more 
IS 481 950-1300 4-15 2 1 
IS5 800-1500 2-9 Y lor2 
IS6 700-900 8 Y 1 
1S607 1700-2500 0 N 2 
1S630 1000-1400 2 Y lor2 
1S66 1350-3000 8-9 Y l or more 
IS 701 1400-1550 4 Y 1 
IS9] 1500-2000 0 N 1 
IS 982 1000 3-9 Y 1 
ISAs1 1200-1500 8-10 Y 1 
ISAzol3 1250-2200 0-4 Y 1 
ISH3 1225-1500 4-5 Y 1 
ISH6 1450 8 Y ISL 
ISKra4 1400-2900 0-9 Y l or more 
ISL3 1300-2300 8 Y ISKra4 
ISLre2 1500-2000 9 Y l 
Tn3 Over 3000 (0) yY More than 1 
ISNCY 1300-2400 0-12 Y/N lor2 


“Presence (Y) or absence (N) of terminal inverted repeats 
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Table 2 


Classification of eukaryotic transposable elements as proposed by Wicker et al. [33] 


Class Order Superfamily Phylogenetic distribution 
Class I (retrotransposons) LTR Copia Plants, metazoans, fungi 
Gypsy Plants, metazoans, fungi 
Bel-Pao Metazoans 
Retrovirus Metazoans 
ERV Metazoans 
DIRS DIRS Plants, metazoans, fungi 
Ngaro Metazoans, fungi 
VIPER Trypanosomes 
PLE Penelope Plants, metazoans, fungi 
LINE R2 Metazoans 
RTE Metazoans 
Jockey Metazoans 
Ll Plants, metazoans, fungi 
SINE tRNA Plants, metazoans, fungi 
7SL Plants, metazoans, fungi 
5S Metazoans 
SVA* Primates 
Retrogenes* Plants, metazoans, fungi 
Class II (DNA transposons) TIR Tcl-Mariner Plants, metazoans, fungi 
Subclass 1 hAT Plants, metazoans, fungi 
Mutator Plants, metazoans, fungi 
Merlin Metazoans 
Transib Metazoans, fungi 
JÈ Plants, metazoans 
PiggyBac Metazoans 
PIF-harbinger Plants, metazoans, fungi 
CACTA Plants, metazoans, fungi 
Crypton Crypton Fungi 
Class II (DNA transposons) Helitron Helitron Plants, metazoans, fungi 
Subclass 2 Maverick Maverick Metazoans, fungi 


Please note that SVAs and retrogenes are not included in that classification 
“Not included in the original Wicker classification 


particle coat and a pol gene that encodes a reverse transcriptase, 
ribonuclease H, and an integrase, which provide the enzymatic 
machinery for reverse transcription and integration into the host 
genome. Reverse transcription occurs within the viral or viral-like 
particle (GAG) in the cytoplasm, and it is a multistep process 
[34]. Unlike LTR retrotransposons, exogenous retroviruses con- 
tain an eny gene, which encodes an envelope that facilitates their 
migration to other cells. Some LTR retrotransposons may contain 
remnants of an eny gene, but their insertion capabilities are limited 
to the originating genome [35]. This would rather suggest that 
they originated in exogenous retroviruses by losing the eny gene. 
However, there is evidence that suggests the contrary, given that 
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Fig. 2 Structures of eukaryotic mobile elements. See text for detailed discussion 


LTR retrotransposons can acquire the eny gene and become infec- 
tious entities [36]. Presently, most of the LTR sequences (85%) in 
the human genome are found only as isolated LTRs, with the 
internal sequence being lost most likely due to homologous 
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recombination between flanking LTRs [37]. Interestingly, LTR 
retrotransposons target their reinsertion to specific genomic sites, 
often around genes, with putative important functional implica- 
tions for a host gene [35]. Lander et al. estimated that 450,000 
LTR copies make up about 8% of our genome [38]. LTR retro- 
transposons inhabiting large genomes, such as maize, wheat, or 
barley, can contain thousands of families. However, despite the 
diversity, very few families comprise most of the repetitive fraction 
in these large genomes. Notable examples are Angela (wheat) [39], 
BARE] (barley) [40], Opie (maize) [41], and Retrosor6 
(sorghum) [42]. 

The DIRS order clusters structurally diverged group of trans- 
posons that possess a tyrosine recombinase (YR) gene instead of an 
integrase (INT) and do not form target site duplications (TSDs). 
Their termini resemble either split direct repeats (SDR) or inverted 
repeats. Such features indicate a different integration mechanism 
than that of other class I mobile elements. DIRS were discovered in 
the slime mold (Dictyostelium discoideum) genome in the early 
1980s [43], and they are present in all major phylogenetic lineages 
including vertebrates [44]. It has been showed that they are also 
common in hydrothermal vent organisms [45 ]. 

Another order, termed Penelope-like elements (PLE), has wide, 
though patchy distribution from amoebae and fungi to vertebrates 
with copy number up to thousands per genome [46]. Interestingly, 
no PLE sequences have been found in mammalian genomes, and 
apparently they were lost from the genome of C. elegans 
[47]. Although PLEs with an intact ORF have been found in 
several genomes, including Ciona and Danio, the only transcrip- 
tionally active representative, Penelope, is known from Drosophila 
virilis. It causes the hybrid dysgenesis syndrome characterized by 
simultaneous mobilization of several unrelated TE families in the 
progeny of dysgenic crosses. It seems that Penelope invaded 
D. virilis quite recently, and its invasive potential was demonstrated 
in D. melanogaster [46]. PLEs harbor a single ORF that codes for a 
protein containing reverse transcriptase (RT) and endonuclease 
(EN) domains. The PLE RT domain more closely resembles telo- 
merase than the RT from LTRs or LINEs. The EN domain is 
related to GIY-YIG intron-encoded endonucleases. Some PLE 
members also have LTR-like sequences, which can be in a direct 
or an inverse orientation, and have a functional intron [46]. 

LINEs [48, 49] do not have LTRs; however, they have a poly-A 
tail at the 3’ end and are flanked by the TSDs. They comprise about 
21% of the human genome and among them Ll with about 
850,000 copies is the most abundant and best described LINE 
family. L1 is the only LINE retroposon still active in the human 
genome [50]. In the human genome, there are two other LINE- 
like repeats, L2 and L3, distantly related to Ll. A contrasting 
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situation has been noticed in the malaria mosquito Anopheles gam- 
biae, where around 100 divergent LINE families compose only 3% 
of its genome [51]. LINEs in plants, e.g., Cin4 in maize and Tal] 
in Arabidopsis thaliana, seem rare as compared with LTR retro- 
transposons. A full copy of mammalian L1 is about 6 kb long and 
contains a PolII promoter and two ORFs. The ORF1 codes for a 
non-sequence-specific RNA binding protein that contains zinc fin- 
ger, leucine zipper, and coiled-coil motifs. The ORF1p functions as 
chaperone for the LI mRNA [52, 53]. The second ORF encodes 
an endonuclease, which makes a single-stranded nick in the geno- 
mic DNA, and a reverse transcriptase, which uses the nicked DNA 
to prime reverse transcription of LINE RNA from the A end. 
Reverse transcription is often unfinished, leaving behind fragmen- 
ted copies of LINE elements; hence most of the L1-derived repeats 
are short, with an average size of 900 bp. LINEs are part of the CRI 
clade, which has members in various metazoan species, including 
fruit fly, mosquito, zebrafish, pufferfish, turtle, and chicken 
[54]. Because they encode their own retrotransposition machinery, 
LINE elements are regarded as autonomous retrotransposons. 
SINEs [48, 49] evolved from RNA genes, such as 7SL and 
tRNA genes. By definition, they are short, up to 1000 base pair 
long. They do not encode their own retrotranscription machinery 
and are considered as nonautonomous elements and in most cases 
are mobilized by the L1 machinery [55]. The outstanding member 
of this class from the human genome is the Alu repeat, which 
contains a cleavage site for the Alu] restriction enzyme that gave 
its name [56]. With over a million copies in the human genome, 
Alu is probably the most successful transposon in the history of life. 
Primate-specific Alu and its rodent relative B1 have limited phylo- 
genetic distribution suggesting their relatively recent origins. The 
mammalian-wide interspersed repeats (MIRs), by contrast, spread 
before eutherian radiation, and their copies can be found in differ- 
ent mammalian groups including marsupials and monotremes 
[57]. SVA elements are unique primate elements due to their 
composite structure. They are named after their main components: 
SINE, VNTR (a variable number of tandem repeats), and Alu 
[58]. Usually, they contain the hallmarks of the retroposition, i.e., 
they are flanked by TSDs and terminated by a poly(A) tail. It seems 
that SVA elements are nonautonomous retrotransposons mobilized 
by LI machinery, and they are thought to be transcribed by RNA 
polymerase II. SVAs are transpositionally active and are responsible 
for some human diseases [59]. They originated less than 25 million 
years ago, and they form the youngest retrotransposon family with 
about 3000 copies in the human genome [58]. 
Retro(pseudo)genes are a special group of retroposed 
sequences, which are products of reverse transcription of a spliced 
(mature) mRNA. Hence, their characteristic features are an absence 
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3.2.2 Class Il: Mobile 
Elements 


of promoter sequence and introns, the presence of flanking direct 
repeats, and a 3’-end polyadenosine tract [60]. Processed pseudo- 
genes, as sometimes retropseudogenes are called, have been gener- 
ated in vitro at a low frequency in the human HeLa cells via mRNA 
from a reporter gene [60]. The source of the reverse transcription 
machinery in humans and other vertebrates seems to be active L1 
elements [61]. However, not all retroposed messages have to end 
up as pseudogenes. About 20% of mammalian protein-encoding 
genes lack introns in their ORFs [62]. It is conceivable that many 
genes lacking introns arose by retroposition. Some genes are known 
to be retroposed more often than others. For instance, in the 
human genome there are over 2000 retropseudogenes of ribosomal 
proteins [63]. A genome-wide study showed that the human 
genome harbors about 20,000 pseudogenes, 72% of which most 
likely arose through retroposition [64]. Interestingly, the vast 
majority (92%) of them are quite recent transpositions that 
occurred after primate /rodent divergence [64]. Some of the retro- 
posed genes may undergo quite complicated evolutionary paths. 
An example could be the RNF13B retrogene, which replaced its 
own parental gene in the mammalian genomes. This retrocopy was 
duplicated in primates, and the evolution of this primate-specific 
copy was accompanied by the exaptation of two TEs, Alu and L1, 
and intron gain via changing a part of coding sequence into an 
intron leading to the origin of a functional, primate-specific retro- 
gene with two splicing variants [65 ]. 


Class II elements move by a conservative cut-and-paste mechanism; 
the excision of the donor element is followed by its reinsertion 
elsewhere in the genome. DNA transposons are abundant in bacte- 
ria, where they are called insertion sequences (see Subheading 3.1), 
but are present in all phyla. Wicker et al. distinguished two sub- 
classes of DNA transposons based on the number of DNA strands 
that are cut during transposition [33]. 

Classical “cut-and-paste” transposons belong to the subclass I, 
and they are classified as the TIR order. They are characterized by 
terminal inverted repeats (TIR) and encode a transposase that binds 
near the inverted repeats and mediates mobility. This process is not 
usually a replicative one, unless the gap caused by excision is 
repaired using the sister chromatid. When inserted at a new loca- 
tion, the transposon is flanked by small gaps, which, when filled by 
host enzymes, cause duplication of the sequence at the target site. 
The length of these TSDs is characteristic for particular transpo- 
sons. Nine superfamilies belong to the TIR order, including Tel- 
Mariner, Merlin, Mutator, and PiggyBac. The second order Cryp- 
ton consists of a single superfamily of the same name. Originally 
thought to be limited to fungi [66], now it is clear that they have a 
wide distribution, including animals and heterokonts [67]. A 
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heterogeneous, small, nonautonomous group of elements MITEs 
also belong to the TIR order [68], which in some genomes ampli- 
fied to thousands of copies, e.g., Stowaway in the rice genome [69], 
Tourist in most bamboo genomes [70], or Galluhop in the chicken 
genome [71]. 

Subclass II includes two orders of TEs that, just as those from 
subclass I, do not form RNA intermediates. However, unlike “clas- 
sical” DNA transposons, they replicate without double-strand 
cleavage. Helitrons replicate using a rolling-circle mechanism, and 
their insertion does not result in the target site duplication 
[72]. They encode tyrosine recombinase along with some other 
proteins. Helitrons were first described in plants, but they are also 
present in other phyla, including fungi and mammals 
[73, 74]. Mavericks are large transposons that have been found in 
different eukaryotic lineages excluding plants [75]. They encode 
various numbers of proteins that include DNA polymerase B and an 
integrase. Kapitonov and Jurka suggested that their life cycle 
includes a single-strand excision, followed by extrachromosomal 
replication and reintegration to a new location [76]. 


4 Identification of Transposable Elements 


With the ever-growing number of sequenced genomes from differ- 
ent branches of the tree of life, there are increasing TE research 
opportunities. There are several reasons why one would like to 
analyze TEs and their “offsprings” left in a genome. First of all, 
they are very interesting biological subjects to study genome struc- 
ture, gene regulation, or genome evolution. In some cases, they 
also make genome assembly and annotation quite challenging, 
especially with the current NGS technology that generates reads 
shorter than TEs. Nevertheless, TEs should be and are worthy to 
study. However, it is not a simple task and requires different 
approaches depending on the level of analysis. We will walk through 
these different levels starting with raw genome sequences without 
any annotation and discuss different methods and software used for 
TE analyses. In principle, we can imagine two scenarios: in the first 
one, genomic or transcriptome sequences are coming from a spe- 
cies for which there is already some information about the transpo- 
son repertoire, for instance, a related genome has been previously 
characterized or TEs have been studied before. In the second 
scenario, we have to deal with a completely unknown genome or 
a genome for which little information exists with regard to TEs. In 
the former case, one can apply a range of techniques used in 
comparative genomics or try to search specific libraries of transpo- 
sons using the “homology search” approach. In the latter, which is 
basically an approach to identify TEs de novo, first we need to find 
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4.1 De Novo 
Approaches to Finding 
Repetitive Elements 


any repeats in a genome and then attempt characterization and 
classification of newly identified repetitive sequences. In this 
approach, we will find any repeats, not necessarily transposons. 
There are many algorithms, and even more software, that can be 
applied in both approaches. 


There are several steps involved in the de novo characterization of 
transposons. First, we need to find all the repeats in a genome, then 
build a consensus of each family of related sequences, and finally 
classify detected sequences. For the first step, three groups of 
algorithms exist: the k-mer approach, sequence self-comparison, 
and periodicity analysis. 

In the k-mer approach, sequences are scanned for overrepre- 
sentation of strings of certain length. The idea is that repeats that 
belong to the same family are compositionally similar and share 
some oligomers. If the repeats occur many times in a genome, then 
those oligomers should be overrepresented. However, since repeats 
and transposons in particular are not perfect copies of a certain 
sequence, some mismatches must be allowed when oligo frequen- 
cies are calculated. The challenge is to determine optimal size of an 
oligo (k-mer) and number of mismatches allowed. Most likely, 
these parameters should be different for different types of transpo- 
sons, i.e., low versus high copy number, old versus young transpo- 
sons, and those from different classes and families. Several programs 
have been developed based on the &-mer idea using a suffix tree data 
structure including REPuter [77, 78], Vmatch (Kurtz, unpub- 
lished; http: //www.vmatch.de/), and Repeat-match 
[79, 80]. Another approach is to use fixed length &-mers as seeds 
and extend those seeds to define repeat’s family as it was imple- 
mented in ReAS [81], RepeatScout [82], and Tallymer 
[83]. Another interesting algorithm can be found in the FORRe- 
peats software [84], which uses factor oracle data structure [85]. It 
starts with detection of exact oligomers in the analyzed sequences, 
followed by finding approximate repeats and their alignment. 

The second group of programs developed for de novo detec- 
tion of repeated sequences is using self-comparison approach. 
Repeat Pattern Toolkit [86], RECON [87], PILER [88, 89], and 
BLASTER [90] belong to this group. The idea is to use one of the 
fast sequence similarity tools, e.g., BLAST [91], followed by clus- 
tering search results. The programs differ in the search engine for 
the initial step, though most are using some of the BLAST algo- 
rithms, the clustering method, and heuristics of merging initial hits 
into a prototype element. For instance, RECON [87], which was 
developed for the repeat finding in unassembled sequence reads, 
starts with an all-to-all comparison using WU-BLAST engine. 
Then, single-linkage clustering is applied to alignment results that 
is followed by construction of an undirected graph with overlap- 
ping. The shortest sequence that contains connected images 
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(aligned subsequences) creates a prototype element. However, this 
procedure might result in composite elements. To avoid this, all the 
images are aligned to the prototype element to detect potential 
illegitimate mergers and split those at every point with a significant 
number of image ends. 

PILER [88, 89] is using a different approach to find initial 
clusters. Instead of BLAST, it uses PALS (pairwise alignment of 
long sequences) for the initial alignment. PALS records only hit 
points and uses banded search of the defined maximum distance to 
optimize its performance. To further improve performance of the 
system, PILER uses different heuristics for different types of 
repeats, i.e., satellites, pseudosatellites, terminal repeats, and inter- 
spersed repeats. Finally, a consensus sequence is generated from a 
multiple sequence alignment of the defined family members. 

Dot matrix is a simple method to compare two biological 
sequences. The graphical output of such an analysis is called a dot- 
plot. Dotplots can be used to detect conserved domains, sequence 
rearrangements, RNA secondary structure, or repeated sequences. It 
compares every residue in one sequence to every residue in the other 
sequence or to every residue of the same sequence in the self- 
comparison mode. In the latter case, there will be a main diagonal 
line representing a perfect match and a number of short diagonal 
lines representing similar regions (red circles in Fig. 3). Interestingly, 
simple repeats appear as diamond shapes on a main diagonal line or 
short vertical and horizontal lines outside the main diagonal line (red 
squares in Fig. 3). The method was introduced to biological analyses 
almost a half century ago [92, 93]. However, the first easy-to-use 
software with a graphical interface, DOTTER, was developed much 
later [94]. The major problem of this approach is the time required 
for the dotplot calculation, which is of quadratic complexity. This 
proved to be prohibitive for comparison of the genome-size 
sequences. One of the solutions to this problem is using a word 
index for the fast identification of substrings. Gepard implements 
the suffix array data structure to improve the execution time [95]. It 
is written in Java, which makes it platform-independent. Gepard 
enables analyses of sequences at the mega-base level in the matter 
of seconds, and it takes about an hour to analyze the whole human 
chromosome I [95]. The example of the dotplot produced by the 
Gepard is presented in Fig. 3. 


With constant improvement of sequencing technology associated 
with decreasing sequencing cost, the number of new sequenced 
genomes is exploding. As of January 2019, there are more than 
7000 eukaryotic and almost 180,000 prokaryotic genomes publicly 
available (information retrieved on January 16, 2019, from https: // 
www.ncbi.nlm.nih.gov/genome/browse/). However, this comes 
with a price; most of the recently sequenced genomes, due to the 
short read sequencing technology, are available at various levels of 
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4.3 Population-Level 
Analyses of 
Transposable 
Elements 


N 


Fig. 3 Graphical output of the Gepard. A 30 kb fragment of mouse chromosome 
12 was compared to itself. Similar sequences are represented by diagonal lines 
if both fragments are located on the same strains or by reverse diagonal lines if 
the fragments with significant similarity are located on opposite strands. Some 
of the examples are marked with the red circles. Simple repeats are represented 
by either diamond shapes on the main diagonal or horizontal and vertical lines. 
Some of the examples are marked with the red squares 


“completeness” or assembly. For most non-model organisms, we 
are presented with draft assemblies of rather short contigs. More- 
over, these genomes usually are not very well annotated, with TEs 
not being on the annotation priority list. Unfortunately, genome 
annotation pipelines do not include TE annotation, focusing on 
protein-coding and RNA-coding genes. To fill the gap, a number of 
methods have been developed to detect repeats from short reads. 
Two algorithms dominate in attempts to determine repeats in NGS 
raw reads: clustering and k-mer. Transposome [96] and RepeatEx- 
plorer [97] employ the former approach, while RepARK [98], 
REPdenovo [99], and dnaPipeTE [100] utilize the latter one. 
Since NGS results in the relatively short reads, assembly of selected 
sequences into longer contigs representing TEs is required after 
initial clustering of the raw reads. 


Recent advances in sequencing technology and the sharp decrease 
in sequencing costs allow genomic studies at population level. 
Although initially focused on human populations [101-103], 
recent population studies of other species have been initiated as 
well [ 104, 105]. One of the common questions in such studies is 
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Fig. 4 Detection of a TE insertion (polymorphic TE) from the NGS data. The upper panel shows real genomic 
sequence with a TE, which is not present in the reference genome (lower panel). Hypothetical discordant pair- 
reads (a, b, d, f, g, i, j, K, |, 0, q, s, and t) have only one the pairs mapped to the reference genome, while the 
other would map to a consensus sequence of a TE. The hypothetical split reads (c, e, h, m, p, and r) will have 
part of the sequence mapped to the reference genome and the other to a TE consensus sequence 


how much structural variation (SV) exists in different populations. 
TE insertions are responsible for about 25% of structural variants in 
human genomes [106]. In general, any tool designed for detection 
of SV should work for TE insertion analysis, but specialized soft- 
ware can take advantage of specific expectations related to inser- 
tions of TEs. Most of the SV-detection algorithms rely on paired- 
end reads and are based on discordant read pair mapping and/or 
split reads mapping (Fig. 4). A discordant pair read is defined as one 
that is inconsistent with the expected insert size in the library used 
for sequencing. For example, if the insert size of the library used for 
sequencing is 300 nt but the reads map to a reference genome 
within much larger distance or to two different chromosomes, 
such a pair is considered to be discordant. If, additionally, one of 
the reads maps to a TE, it might be an indication of a polymorphic 
TE. Usually some filtering is used to reduce a chance of false 
positives. These include minimum read number in the cluster 
mapped to a unique position, quality score of the reads, or consis- 
tency in reads orientation. However, the discordant read mapping 
cannot detect exact insertion position. Therefore another step is 
required that may include local assembly and split-read mapping. 
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4.4 Comparative 
Genomics of TE 
Insertions 


A split read is defined as a read for which part of it maps 
uniquely to one position in the genome and the other part to 
another position. This is, for example, a very common feature of 
the mapping of RNA-seq data to eukaryotic genomes when reads 
span two exons. Split reads are being also observed if structural 
variants exist. In a case of a TE insertion, a part of the read will be 
mapped to a unique location and the rest to a TE in some other 
location or may not be mapped at all (Fig. 4). 

Different methods for structure variant detection return differ- 
ent results on the same data. Recently published benchmarking 
demonstrates that TE detection is not an exception 
[107, 108]. Ewing [107] compared TranspoSeq [109] with two 
other tools, Tea [110] and TraFIC [111], on the same data sets. 
Results were not very encouraging as in both comparisons there 
was a high fraction of insertions detected only by a single program 
[107]. Similar conclusion was drawn by Rishishwar et al. [108] ina 
benchmark of larger number of tools including MELT [106], 
Mobster [112], and RetroSeq [113]. It is clear that different soft- 
ware have different biases, and each one can produce a high number 
of false positives. It is recommended then to employ several pro- 
grams for high confidence results. Exhaustive tests run on real and 
simulated human genome data showed superior performance of 
MELT [106, 108]. TIPseqHunter is another tool developed to 
identify transposon insertion sites based on the transpose insertion 
profiling using next-generation sequencing [114]. It employs 
machine learning algorithm to ensure high precision and reliability. 
It is worth to note that all these tools were designed for short read 
sequencing methods. However, with current development of 
single-molecule long reads, sequencing technologies such as Pac- 
Bio and Oxford Nanopore may make these methods irrelevant and 
obsolete. Long reads should be of superior performance and make 
TE insertion detection relatively easy with more traditional 
aligners, such as MegaBLAST [115], BLAT [116], or LAST [117]. 


To understand the general pattern of TE insertions in different 
genomes and evolutionary dynamics of TE families, a comparative 
approach is necessary. Although precomputed alignments of differ- 
ent genomes are publicly available, for example, the UCSC 
Genome Browser includes Multiz alignments of 100 vertebrate 
genomes [118], not many tools are available for such analyses. 
One of them is GPAC (genome presence/absence compiler) that 
creates a table of presence and absence of certain elements based on 
the precomputed multiple genomes alignment [119] (http://bioin 
formatics.uni-muenster.de/tools/gpac/index.hbi). The tool is 
quite generic, but is well suited for the TE comparative analysis 
(see Fig. 5 for an example). 
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4.5 Classification of 
Transposable 
Elements 


Once the consensus of a repetitive element has been constructed, it 
can be subjected to further analyses. There are two major categories 
of programs dealing with the issue of TE classification: library or 
similarity-based and signature-based. The latter approach is very 
often used in specialized software, i.e., tailored for specific type of 
TEs. However, some general tools also exist, e.g., TEclass [120]. 
The library approach is probably the most common approach 
for TE classification. It is also very efficient and quite reliable as long 
as good libraries of prototype sequences exist. In practice, it is the 
recommended approach when we analyze sequences from well- 
characterized genomes or from a genome relatively closely related 
to a well-studied one. For instance, since the human genome is one 
of the best studied, any primate sequences can be confidently 
analyzed using the library approach. Most likely, the first software 
using the similarity-based approach for repeat classification was 
Censor developed by Jerzy Jurka in the early 1990s [121]. It uses 
RepBase [122] as a reference collection and BLAST as a search 
engine [91]. However, the most popular TE detection software is 
RepeatMasker (RM) (http://www.repeatmasker.org). Interest- 
ingly, RM is also using RepBase as a reference collection and 
AB-BLAST, RM-BLAST, or cross-match as a search engine. In 
both cases, original search hits are processed by a series of Perl 
scripts to determine the structure of elements and classify them to 
one of known TE families. Both Censor and RM also employ user- 
provided libraries, including “third-party” lineage-specific libraries, 
e.g., TREP [123]. Over the years, RepeatMasker has become a 
standard tool for TE analyses, and often its output is used for 
more biologically oriented studies (see below). The aforemen- 
tioned programs have one important drawback: since they are 
completely based on sequence similarity, they can detect only TEs 
that had been previously described. Nevertheless, similarity 
searches, like in many other bioinformatics tasks, should be the 
first approach for the analysis of repetitive elements. 
Signature-based programs are searching for certain features 
that characterize specific TEs, for example, long terminal repeats 
(LTRs), target site duplications (TSDs), or primer-binding sites 
(PBSs). Since different types (families) of elements are structurally 
different, they require specific rules for their detection. Hence, 
many of the programs that use signature-based algorithms are 
specific for certain type of transposons. There are a number of 
programs specialized in detection of LTR transposons, which are 
based on a similar methodology. They take into account several 
structural features of LTR retroposons including size, distance 
between paired LTRs and their similarity, the presence of TSDs, 
and the presence of replication signals, i.e., the primer-binding site 
and the polypurine tract (PPTs). Some of the programs check also 
for ORFs coding for the gag, pol, and eny proteins. LTR_STRUC 
[124] was one of the first programs based on this principle. It uses 


4.6 Pipelines 
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seed-and-extend strategy to find repeats located within user- 
defined distance. The candidate regions are extended based on 
the pairwise alignment to determine cognate LTRs’ boundaries. 
Putative full-length elements are scored based on the presence of 
TSD, PBS, PPT, and reverse transcriptase ORF. However, because 
of the heuristics described above, LTR_STRUC is unable to find 
incomplete LTR transposons and in particular solo LTRs. Another 
limitation of this program is its Windows-only implementation that 
significantly prohibits automated large-scale analysis. Several other 
programs have been developed based on similar principles, e.g., 
LTR_par [125], find_LTR [126], LTR_FINDER [127], and 
LTRharvest [128]. Lerat tested performance of these programs 
[129], and although sensitivity of the methods was acceptable 
(between 40% and 98%), it was at the expense of specificity, which 
was very poor. In several cases, the number of falsely assigned 
transposons exceeded the number of correctly detected ones. 

Another group of transposons that have a relatively conserved 
structure are MITEs and Helitrons. Several specialized programs 
were developed that take advantage of their specific structure. 
FINDMITE [130] and MUST [131] are tailored for MITEs, 
while HelitronFinder [132] and HelSearch [133] were developed 
for Helitron detection. 

A further interesting approach to transposon classification was 
implemented by Abrusan et al. [120] in the software package called 
TEclass, which classifies unknown TE consensus sequences into 
four categories, according to their mechanism of transposition: 
DNA transposons, LTRs, LINEs, and SINEs. The classification 
uses support vector machines, random forests, learning vector 
quantization, and predicts ORFs. Two complete sets of classifiers 
are built using tetramers and pentamers, which are used in two 
separate rounds of the classification. The software assumes that the 
analyzed sequence represents a TE and the classification process is 
binary, with the following steps: forward versus reverse sequence 
orientation > DNA versus retrotransposon > LTRs versus 
nonLTRs (for retroelements) > LINEs versus SINEs (for nonLTR 
repeats). Ifthe different methods of classification lead to conflicting 
results, TEclass reports the repeat either as unknown or as the last 
category where the classification methods agree (http: //bioinfor 
matics.uni-muenster.de/tools /teclass /index.hbi). 


Recent years witnessed some attempt to create more complex, 
global analyses systems. One such a system is REPCLASS 
[134]. It consists of three classification modules: homology 
(HOM), structure (STR), and target site duplication (TSD). Each 
module can be run separately or in the pairwise manner, whereas 
the final step of the analysis involves integration of the results 
delivered by each module. There is one interesting novelty in the 
STR module, namely, implementation of CR NAscan-SE [135] to 
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4.7 Meta-analyses 


detect tRNA-like secondary structure within the query sequence, 
one of the signatures of many SINE families. The REPPET is 
another pipeline for TE sequence analyses. It uses “classical” 
three-step approach for de novo TE identification: self-alignment, 
clustering, and consensus sequences generation. However, the 
pipeline is using a spectrum of different methods at each step, 
followed by a rigorous TE classification step based on recently 
proposed classification of TEs [136]. Unfortunately, a complex 
implementation that makes installation and running the system 
rather difficult limits usage of the pipeline. The classification step 
seems to be unreliable as it may annotate lineage-specific TEs in 
wrong taxonomical lineages (Kouzel and  Makalowski, 
unpublished data). 

There are other attempts to create comprehensive systems for 
“repeatome” analysis. One of them is dnaPipeTE developed for 
mosquito genomes’ analyses [100]. Interestingly, dnaPipeTE 
works on the raw NGS data, which makes the pipeline well suited 
for genomes with lower sequencing depth. The raw reads are first 
subjected to k-mer count on the sampled data. The sampling of the 
data to size less than 0.25x of the genome is required to avoid 
clustering reads representing unique sequences. The determined 
repetitive reads are assembled into contigs using ‘Trinity 
[137]. Although Trinity was originally developed for transcriptome 
assembly from RNA-seq data, it proves to be very useful for TEs 
assembly from short reads as it can efficiently determine consensus 
sequences of closely related transposons. In the next step, dnaPi- 
peTE annotates repeats using RepeatMasker with either built-in or 
user-defined libraries. This is probably the weakest point of the 
pipeline as it will not annotate any novel TEs, which have no similar 
sequences present in the provided libraries. It would be useful to 
complement this step with model-based or machine learning 
approaches (see Subheading 4.5). After contigs’ annotation, copy 
number of the TEs are estimated using BLAST algorithm 
[91]. Finally, sequence identity between an individual TE and its 
consensus sequence is used to determine the relative age of the TEs. 
The pipeline produces a number of output files including several 
graphs, i.e., pie chart with the relative proportion of the main 
repeat classes and graph with the number of base pairs aligned on 
each TE contig and TE age distribution. Overall, the dnaPipeTE is 
very efficient, outperforming, according to the authors, RepeatEx- 
plorer by severalfold [100]. 


Most of the software developed are focused on the TE discovery 
and rarely offer more biological oriented analyses. Consequently, 
researchers interested in TE biology or using TE insertions as tools 
for another biological investigations need to utilize other resources. 
One of them is TinT (transposition in transposition), tool that 
applies maximum likelihood model of TE insertion probability to 
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estimate relative age of TE families [138] (http://bioinformatics. 
uni-muenster.de/tools/tint/index.hbi). In the first steps, it takes 
RepeatMasker output to detect nested retroposons. Then, it gen- 
erates a data matrix that is used by a probabilistic model to estimate 
chronology and activity period of analyzed families. The method 
was applied to resolve the evolutionary history of galliformes [139], 
marsupials [140], lagomorphs [141], squirrel monkey [142], or 
elephant shark [143]. 

Another interesting application that takes advantage of TEs is 
their use for detecting signatures of positive selection [144], a 
central goal in the field of evolutionary biology. A typical research 
scenario for this application would be investigating whether a spe- 
cific TE fragment exapted into resident genomic features, such as 
proximal and distal enhancers or exons of spliced transcripts, has 
undergone accelerated evolution that could be indicative of gain of 
function events. In short, the test first requires the identification of 
all genomically interspersed TE fragments that are homolog to the 
TE segment of interest, which can be done through alignments 
with a family consensus sequence. Based on multi-species genome 
alignments, a second step involves identification of lineage-specific 
substitutions in every single homolog fragment, which are then 
consolidated into a distribution of lineage-specific substitutions 
that provides the expectation (null distribution) for a segment 
evolving largely without specific constraints (neutrally). A signifi- 
cantly higher number of lineage-specific substitutions observed in 
the TE fragment of interest compared to the null distribution could 
then be interpreted as a molecular signature of adaptive evolution. 
However, the possibility of confounding molecular mechanisms, 
such as GC-biased gene conversion [145-147], needs to be eval- 
uated. We note that building the null distribution based only on 
data from intergenic regions, where transcription-coupled repair is 
absent, results in a more liberal estimate of the expected substitu- 
tions, which in turn leads to a more conservative estimate of the 
adaptive evolution. Additionally, building the null distribution 
requires the detection of many homolog fragments, which limits 
the applicability of the test to TE families with numerous members 
in a given genome. Prime examples would be human Alu or murine 
B1 SINEs. In theory, this test could also be used for detecting 
signatures of purifying selection by searching for fragments 
depleted of lineage-specific substitutions. However, the low level 
or complete lack of lineage-specific substitution is characteristic to 
many TE fragments, obscuring the effect of potential purifying 
forces. 
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5 Concluding Remarks 


Table 3 


Annoying junk for some, hidden treasure for others, TEs can hardly 
be ignored [148]. With their diversity and high copy number in 
most of the genomes, they are not the easiest biological entities to 
analyze. Nevertheless, recent years witnessed increased interest in 
TEs. On the one hand, we observe improvement in computational 
tools specialized in TE analyses. Table 3 lists some of such tools and 


Selected resources for transposable elements discovery and analyses 


Software Address 

AB-BLAST http: //www.advbiocomp.com /blast.html 

ACLAME http: //aclame.ulb.ac.be/ 

BLASTER suite http: //urgi.versailles.inra.fr/index.php/urgi/Tools/BLASTER 
Censor http://www. girinst.org /censor/download.php 

DOTTER http: //sonnhammer.sbc.su.se/Dotter.html 

DROPOSON ftp://biom3.univ-lyon1 .fr//pub/drosoposon/ 

bnd Ir http: //darwin.informatics.indiana.edu/cgi-bin/evolution/Itr.pl 
FINDMITE http: //jaketu.biochem.vt.edu/dl_software.htm 

FORRepeats http: //al jalix.org/FORRepeats/ 

Gepard http: //cube.univie.ac.at/gepard 

HelitronFinder http: //limei.montclair.edu/HT.html 

HelSearch http: //sourceforge.net/project/showfiles.php? group_id=260708 
HERVd http: //herv.img.cas.cz/ 

IRF http://tandem.bu.edu/irf/irf.download.html 

LTR FINDER http://tlife.fudan.edu.cn/ltr_finder/ 

LTR MINER http://genomebiology.com/2004/5/10/R79/suppl/s7 

LTR par http://www.eecs.wsu.edu/~ananth/software.htm 


MGEScan-LTR 
MGEScan-nonLTR 
microTranspoGene 
MITE-Hunter 
PILER 

REannotate 

ReAS 

RECON 


http: //darwin.informatics.indiana.edu/cgi-bin/evolution/daphnia_Itr.pl 
http: //darwin.informatics.indiana.edu/cgi-bin/evolution/nonltr/nonltr.pl 
http: //transpogene.tau.ac.il/microTranspoGene.html 

http: //target.iplantcollaborative.org/mite_hunter.html 

http: //www.drive5.com/piler/ 

http://www. bioinformatics.org/reannotate/index.html 
ftp://ftp.genomics.org.cn/pub/ReAS /software/ 

http: //eddylab.org/software/recon/ 
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Table 3 
(continued) 

Software Address 

RepSeek http: //wwwabi.snv.jussieu.fr/public/RepSeeck/ 

RepeatFinder http: //cbcb.umd.edu/software/RepeatFinder/ 

RepeatMasker http: //www.repeatmasker.org/ 

RepeatModeler http: //www.repeatmasker.org /RepeatModeler/ 

RepeatRunner http://www. yandell-lab.org /software/repeatrunner.html 


Repeat-match 
REPET 
RepMiner 
REPuter 
RetroMap 
SMaRTFinder 
SoyTEdb 


Spectral Repeat Finder 


T-lex 

Tallymer 
TARGeT 
TEclass 

TE Displayer 
TE nest 

TESD 

TinT 
TIPseqHunter 
TRANSPO 
TranspoGene 
Transposon-PSI 
TRAP 

TRE 

TROLL 
TSDfinder 
WikiPoson 
VariationHunter 


Vmatch 


http: //mummer.sourceforge.net/ 

http: //urgi.versailles.inra.fr/index.php /urgi/Tools/REPET 
http: //repminer.sourceforge.net/index.htm 

http: //bibiserv.techfak.uni-bielefeld.de/reputer/ 
http://www. burchsite.com/bioi/RetroMapHome.html 
http://services.appliedgenomics.org/software/smartfinder/ 
http://www.soytedb.org 

http: //www.imtech.res.in/raghava/srf/ 

http: //petrov.stanford.edu/cgi-bin/Tlex.html 

http: //www.zbh.uni-hamburg.de/Tallymer/ 

http: //target.iplantcollaborative.org/ 

http://www. bioinformatics.uni-muenster.de/tools/teclass/ 
http: //labs.csb.utoronto.ca/yang/TE_Displayer/ 

http: //www.plantgdb.org/prj/TE_nest/TE_nest.html 

http: //pbil.univ-lyon1.fr/software/TESD/ 

http://www. bioinformatics.uni-muenster.de/tools/tint/ 
https: //github.com/fenyolab/TIPseqHunter 

http: //alggen.Isi-upc.es/recerca/search/transpo/transpo.html 
http: //transpogene.tau.ac.il/ 

http: //transposonpsi.sourceforge.net/ 

http: //www.coccidia.icb.usp.br/trap /tutorials/ 

http: //tandem.bu.edu/trf/trf- html 

http: //finder.sourceforge.net/ 


http: //www.ncbi.nlm.nih.gov/CBBresearch/Landsman/TSDfinder/ 


http://www. bioinformatics.org/wikiposon/doku.php 
http: //compbio.cs.sfu.ca/software-variation-hunter 


http: //www.vmatch.de/ 


200 


Wojciech Makatowski et al. 


the up-to-date list can be found at our web site: http://www. 
bioinformatics.uni-muenster.de/ScrapYard/. On the other hand, 
improved tools and new technologies enable biologists to explore 
new research avenues that might lead to novel, fascinating insights 
into the biology of mobile elements. 


References 


l. 


10. 


ll. 


. SanMiguel 


Waring M, Britten RJ (1966) Nucleotide 
sequence repetition - a rapidly reassociating 
fraction of mouse DNA. Science 154 
(3750):791-794 


. Britten RJ, Kohne DE (1968) Repeated 


sequences in DNA. hundreds of thousands 
of copies of DNA sequences have been 
incorporated into the genomes of higher 
organisms. Science 161(841):529-540 


. Makalowski W (2001) The human genome 


structure and organization. Acta Biochim 
Pol 48(3):587-598 


. C._elegans_Sequencing Consortium (1998) 


Genome sequence of the nematode 
C. elegans: a platform for investigating biol- 
ogy. Science 282(5396):2012-2018 


P, Tikhonov A, Jin YK, 
Motchoulskaia N, Zakharov D, Melake- 
Berhan A, Springer PS, Edwards KJ, Lee M, 
Avramova Z, Bennetzen JL (1996) Nested 
retrotransposons in the intergenic regions of 
the maize genome. Science 274 
(5288):765-768 


. Keller EF (1983) A feeling for the organism: 


the life and work of Barbara McClintock. 
W.H. Freeman, San Francisco 


. McClintock B (1950) The origin and behav- 


ior of mutable loci in maize. Proc Natl Acad 
Sci U S A 36(6):344-355 


. McClintock B (1951) Chromosome organi- 


zation and genic expression. Cold Spring 
Harb Symp Quant Biol 16:13—47 


. McClintock B (1956) Controlling elements 


and the gene. Cold Spring Harb Symp 
Quant Biol 21:197-216 

Malamy MH, Fiandt M, Szybalski W (1972) 
Electron microscopy of polar insertions in the 
lac operon of Escherichia coli. Mol Gen Genet 
119(3):207-222 

Ohno S (1972) So much “junk” DNA in our 
genome. In: Smith HH (ed) Brookhaven 
symposia in biology, vol 23. Gordon & 
Breach, New York, pp 366-370 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


Aronson AI, Bolton ET, Britten RI, Cowie 
DB, Duerksen JD, McCarthy BJ, 
McQuillen K, Roberts RB (1960) Biophysics. 
In: Yearbook 59, vol 59. Carnegie Institution 
of Washington, Washington, pp 229-279 


Ehret CF, De Haller G (1963) Origin, devel- 
opment and maturation of organelles and 
organelle systems of the cell surface in Para- 
mecium. J Ultrastruct Res 23(Suppl 6):1-42 


Brosius J (1991) Retroposons--seeds of evo- 
lution. Science 251(4995):753 


Makalowski W, Mitchell GA, Labuda D 
(1994) Alu sequences in the coding regions 
of mRNA: a source of protein variability. 
Trends Genet 10(6):188-193 


Jordan IK, Rogozin IB, Glazko GV, Koonin 
EV (2003) Origin of a substantial fraction of 
human regulatory sequences from transposa- 
ble elements. Trends Genet 19(2):68-72. Pii 
S0168-9525(02)00006-9 


Thornburg BG, Gotea V, Makalowski W 
(2006) Transposable elements as a significant 
source of transcription regulating signals. 
Gene 365:104-110. https://doi.org/10. 
1016/j.gene.2005.09.036. S0378-1119(05) 
00653-0 [pii] 

Feschotte C (2008) Transposable elements 
and the evolution of regulatory networks. 
Nat Rev Genet 9(5):397—405. https://doi. 
org/10.1038/nrg2337 


Mita P, Boeke JD (2016) How retrotranspo- 
sons shape genome regulation. Curr Opin 
Genet Dev 37:90-100. https://doi.org/10. 
1016/j.gde.2016.01.001 


Chuong EB, Elde NC, Feschotte C (2017) 
Regulatory activities of transposable elements: 
from conflicts to benefits. Nat Rev Genet 18 
(2):71-86. https: //doi.org/10.1038/nrg. 
2016.139 

Franke V, Ganesh S, Karlic R, Malik R, 
Pasulka J, Horvat F, Kuzman M, Fulka H, 
Cernohorska M, Urbanova J, Svobodova E, 
Ma J, Suzuki Y, Aoki F, Schultz RM et al 


22. 


23. 


24. 


25. 


26. 


27 


28. 


29. 


30. 


31. 


32. 


Transposable Elements: Classification, Identification, and Their Use... 


(2017) Long terminal repeats power evolu- 
tion of genes and gene expression programs 
in mammalian oocytes and zygotes. Genome 
Res 27(8):1384-1394. https://doi.org/10. 
1101/gr.216150.116 

Wang L, Rishishwar L, Marino-Ramirez L, 
Jordan IK (2017) Human population-specific 
gene expression and transcriptional network 
modification with polymorphic transposable 
elements. Nucleic Acids Res 45 
(5):2318-2328. https://doi.org/10.1093/ 
nar/gkw1286 

Venuto D, Bourque G (2018) Identifying 
co-opted transposable elements using com- 
parative epigenomics. Develop Growth Differ 


60(1):53-62. https://doi.org/10.1111/ 
dgd.12423 
Mahillon J, Chandler M (1998) Insertion 


sequences. Microbiol Mol Biol Rev 62 
(3):725-774 
Wilde C, Escartin F, Kokeguchi S, Latour- 


Lambert P, Lectard A, Clement JM (2003) 
Transposases are responsible for the target 
specificity of IS1397 and ISKpnl for two 
different types of palindromic units (PUs), 
Nucleic Acid Res 31(15):4345-4353 

Derbyshire KM, Grindley NDF (1996) Cis 
preference of the IS903 transposase is 
mediated by a combination of transposase 
instability and inefficient translation. Mol 
Microbiol 21(6):1261-1272. https://doi. 
org/10.1111/j.1365-2958.1996.tb02587.x 


. Ichikawa H, Ikeda K, Amemura J, Ohtsubo E 


(1990) Two domains in the terminal 
inverted-repeat sequence of transposon Tn3. 
Gene 86(1):11-17 

Maekawa T, Amemura-Maekawa J, Ohtsubo 
E (1993) DNA binding domains in Tn3 
transposase. Mol Gen Genet 236 
(2-3):267-274 

Weinert TA, Schaus NA, Grindley ND (1983) 
Insertion sequence duplication in transposi- 
tional ` recombination. Science 222 
(4625):755-765 

Turlan C, Chandler M (1995) IS1-mediated 
intramolecular rearrangements: formation of 
excised transposon circles and replicative dele- 
tions. EMBO J 14(21):5410-5421 


Siguier P, Gourbeyre E Varani A, 
Ton-Hoang B, Chandler M (2015) Every- 
man’s guide to bacterial insertion sequences. 
Microbiol Spectr 3(2):MDNA3-0030-2014. 
https: //doi.org/10.1128/microbiolspec. 
MDNA3-0030-2014 

Finnegan DJ (1989) Eukaryotic transposable 


elements and genome evolution. Trends 
Genet 5(4):103-107 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


42 


201 


Wicker T, Sabot F, Hua-Van A, Bennetzen JL, 
Capy P, Chalhoub B, Flavell A, Leroy P, 
Morgante M, Panaud O, Paux E, 
SanMiguel P, Schulman AH (2007) A unified 
classification system for eukaryotic transposa- 
ble elements. Nat Rev Genet 8(12):973-982. 
https: //doi.org/10.1038/nrg2165. 
nrg2165 [pii] 

Hughes SH (2015) Reverse transcription of 
retroviruses and LTR retrotransposons. 
Microbiol Spectr 3(2):MDNA3-0027-2014. 
https: //doi.org/10.1128 /microbiolspec. 
MDNA3-0027-2014 


Kazazian HH Jr (2004) Mobile elements: dri- 
vers of genome evolution. Science 303 
(5664):1626-1632. https://doi.org/10. 
1126/science.1089670. 303/5664/1626 
[pii] 

Malik HS, Henikoff S, Eickbush TH (2000) 
Poised for contagion: evolutionary origins of 
the infectious abilities of invertebrate retro- 
viruses. Genome Res 10(9):1307-1318 


Leib-Mosch C, Haltmeier M, Werner T, Geigl 
EM, Brack-Werner R, Francke U, Erfle V, 
Hehlmann R (1993) Genomic distribution 
and transcription of solitary HERV-K LTRs. 
Genomics 18(2):261-269. https://doi.org/ 
10.1006/geno.1993.1464 

Lander ES, Linton LM, Ben B, 
Nusbaum C, Zody MC, Baldwin J, 
Devon K, Dewar K, Doyle M, FitzHugh W, 
Funke R, Gage D, Harris K, Heaford A, How- 
land J et al (2001) Initial sequencing and 
analysis of the human genome. Nature 409 
(6822):860-921. https: //doi.org/10.1038/ 
35057062 


Wicker T, Stein N, Albar L, Feuillet C, 
Schlagenhauf E, Keller B (2001) Analysis of 
a contiguous 211 kb sequence in diploid 
wheat (Triticum monococcum L.) reveals 
multiple mechanisms of genome evolution. 
Plant J 26(3):307-316. tpj1028 [pii] 

Vicient CM, Kalendar R, Anamthawat- 
Jonsson K, Schulman AH (1999) Structure, 
functionality, and evolution of the BARE-1 
retrotransposon of barley. Genetica 107 
(1-3):53-63 

SanMiguel P, Gaut BS, Tikhonov A, 
Nakajima Y, Bennetzen JL (1998) The pale- 
ontology of intergene retrotransposons of 
maize. Nat Genet 20(1):43—45. https://doi. 
org/10.1038/1695 


. Peterson DG, Schulze SR, Sciara EB, Lee SA, 


Bowers JE, Nagel A, Jiang N, Tibbitts DC, 
Wessler SR, Paterson AH (2002) Integration 
of Cot analysis, DNA cloning, and high- 
throughput sequencing facilitates genome 
characterization and gene discovery. Genome 


202 


43. 


44. 


45. 


46. 


47. 


48. 


49. 


50. 


51. 


52. 


53. 


Wojciech Makatowski et al. 


Res 12(5):795-807. https://doi.org/10. 
1101/gr.226102. Article published online 
before print in April 2002 


Zuker C, Lodish HF (1981) Repetitive DNA 
sequences cotranscribed with developmen- 
tally regulated Dictyostelium discoideum 
mRNAs. Proc Natl Acad Sci U S A 78 
(9):5386-5390 

Goodwin TJ, Poulter RT (2001) The DIRS1 
group of retrotransposons. Mol Biol Evol 18 
(11):2067-2082 


Piednoel M, Bonnivard E (2009) DIRS1-like 
retrotransposons are widely distributed 
among Decapoda and are particularly present 
in hydrothermal vent organisms. BMC Evol 
Biol 9:86. https://doi.org/10.1186/1471- 
2148-9-86 

Evgen’ev MB, Arkhipova IR (2005) 
Penelope-like elements - a new class of retro- 
elements: distribution, function and possible 
evolutionary significance. Cytogenet Genome 
Res 110(1-4):510-521. https: //doi.org/10. 
1159/000084984 


Arkhipova IR (2006) Distribution and phy- 
logeny of Penelope-like elements in eukar- 
yotes. Syst Biol 55(6):875-885. https: //doi. 
org/10.1080/10635150601077683 

Singer MF (1982) Highly repeated sequences 
in mammalian genomes. Int Rev Cytol 
76:67-112 


Singer MF (1982) SINEs and LINEs: highly 
repeated short and long interspersed 
sequences in mammalian genomes. Cell 28 
(3):433-434 

Mills RE, Bennett EA, Iskow RC, Devine SE 
(2007) Which transposable elements are 
active in the human genome? Trends Genet 
23(4):183-191. https://doi.org/10.1016/j. 
tig.2007.02.006 


Biedler J, Tu Z (2003) Non-LTR retrotran- 
sposons in the African malaria mosquito, 
Anopheles gambiae: unprecedented diversity 
and evidence of recent activity. Mol Biol Evol 
20(11):1811-1825. https://doi.org/10. 
1093/molbev/msg189. msg189 [pii] 


Martin SL, Cruceanu M, Branciforte D, 
Wai-Lun Li P, Kwok SC, Hodges RS, Wil- 
liams MC (2005) LINE-1 retrotransposition 
requires the nucleic acid chaperone activity of 
the ORFI1 protein. J Mol Biol 348 
(3):549-561. ` https://doi.org/10.1016/j. 
jmb.2005.03.003 

Martin SL (2010) Nucleic acid chaperone 
properties of ORFlp from the non-LTR 


54. 


55 


56. 


57. 


58. 


59. 


60. 


ol. 


62. 


63. 


64. 


65. 


retrotransposon, LINE-1. RNA Biol 7 
(6):706-711 

Kapitonov VV, Jurka J (2003) Molecular pale- 
ontology of transposable elements in the Dro- 
sophila melanogaster genome. Proc Natl Acad 
Sci U S A 100(11):6569-6574. https: //doi. 


org/10.1073/pnas.0732024100 


. Kajikawa M, Okada N (2002) LINEs mobi- 


lize SINEs in the eel through a shared A 
sequence. Cell 111(3):433-444. 
S$0092867402010413 [pii] 

Houck CM, Rinehart FP, Schmid CW (1979) 
A ubiquitous family of repeated DNA 
sequences in the human genome. J Mol Biol 
132(3):289-306 

Jurka J, Zietkiewicz E, Labuda D (1995) 
Ubiquitous mammalian-wide interspersed 
repeats (MIRs) are molecular fossils from the 
mesozoic era. Nucleic Acids Res 23 
(1):170-175 

Wang H, Xing J, Grover D, Hedges DJ, Han 
KD, Walker JA, Batzer MA (2005) SVA ele- 
ments: a hominid-specific retroposon family. J 
Mol Biol 354(4):994-1007. https://doi. 
org/10.1016/j.jmb.2005.09.085 

Ostertag EM, Goodier JL, Zhang Y, Kazazian 
HH (2003) SVA elements are nonautono- 
mous retrotransposons that cause disease in 


humans. Am J Hum Genet 73 
(6):1444-1451. https: //doi.org/10.1086/ 
380207 


Vanin EF (1985) Processed pseudogenes: 
characteristics and evolution. Annu Rev 
Genet 19:253-272 

Maestre J, Tchenio T, Dhellin O, Heidmann 
T (1995) mRNA retroposition in human 
cells: processed pseudogene formation. 
EMBO J 14:6333-6338 

Kabza M, Ciomborowska J, Makalowska I 
(2014) RetrogeneDB--a database of animal 
retrogenes. Mol Biol Evol 31 
(7):1646-1648. https://doi.org/10.1093/ 
molbev/msu139 

Zhang Z, Harrison P, Gerstein M (2002) 
Identification and analysis of over 2000 ribo- 
somal protein pseudogenes in the human 
genome. Genome Res 12:1466-1482 
Torrents D, Suyama M, Zdobnov E, Bork P 
(2003) A genome-wide survey of human 
pseudogenes. Genome Res 13:2559-2567 
Szcześniak MW, Ciomborowska J, Nowak W, 
Rogozin IB, Makatowska I (2011) Primate 
and rodent specific intron gains and the origin 
of retrogenes with splice variants. Mol Biol 
Evol 28:33-38 


66. 


67. 


68. 


69. 


70. 


71. 


72. 


73. 


74. 


75. 


76. 


77. 


Transposable Elements: Classification, Identification, and Their Use... 


Goodwin TJ, Butler MI, Poulter RT (2003) 
Cryptons: a group of tyrosine-recombinase- 
encoding DNA transposons from pathogenic 
fungi. Microbiology 149. (Pt 11:3099-3109 
Kojima KK, Jurka J (2011) Crypton transpo- 
sons: identification of new diverse families and 
ancient domestication events. Mob DNA 2 
(1):12. https: //doi.org/10.1186/1759- 
8753-2-12 

Bureau TE, Wessler SR (1994) Stowaway: a 
new family of inverted repeat elements asso- 
ciated with the genes of both monocotyle- 
donous and dicotyledonous plants. Plant 
Cell 6(6):907-916.  https://doi.org/10. 
1105/tpc.6.6.907. 6/6/907 [pii] 

Feschotte C, Swamy L, Wessler SR (2003) 
Genome-wide analysis of mariner-like trans- 
posable elements in rice reveals complex rela- 
tionships with stowaway miniature inverted 
repeat transposable elements (MITEs). 
Genetics 163(2):747-758 

Zhou MB, Tao GY, Pi PY, Zhu YH, Bai YH, 
Meng XW (2016) Genome-wide characteri- 
zation and evolution analysis of miniature 
inverted-repeat transposable elements 
(MITEs) in moso bamboo (Phyllostachys het- 
erocycla). Planta 244(4):775-787. https:// 
doi.org/10.1007/s00425-016-2544-0 
Wicker T, Robertson JS, Schulze SR, Feltus 
FA, Magrini V, Morrison JA, Mardis ER, Wil- 
son RK, Peterson DG, Paterson AH, Ivarie R 
(2005) The repetitive landscape of the 
chicken genome. Genome Bes 15 
(1):126-136. https://doi.org/10.1101/gr. 
2438004 

Kapitonov VV, Jurka J (2001) Rolling-circle 
transposons in eukaryotes. Proc Natl Acad Sci 
U S A 98:8714-8719 

Hood ME (2005) Repetitive DNA in the 
automictic fungus Microbotryum violaceum. 
Genetica 124(1):1-10 

Pritham EJ, Feschotte C (2007) Massive 
amplification of rolling-circle transposons in 
the lineage of the bat Myotis lucifugus. Proc 
Natl Acad Sci U S A 104:1895-1900 
Pritham EJ, Putliwala T, Feschotte C (2007) 
Mavericks, a novel class of giant transposable 
elements widespread in eukaryotes and related 
to DNA viruses. Gene 390(1-2):3-17. 
https: //doi.org/10.1016/j.gene.2006.08. 
008. S0378-1119(06)00537-3 [pii] 
Kapitonov VV, Jurka J (2006) Self- 
synthesizing DNA transposons in eukaryotes. 
Proc Natl Acad Sci U S A 103:4540-4545 
Kurtz S, Schleiermacher C (1999) REPuter: 
fast computation of maximal repeats in 


78. 


79. 


80. 


81. 


82. 


83. 


84. 


85. 


86. 


87. 


88. 


203 


complete genomes. Bioinformatics 15 
(5):426-427 

Kurtz S, Choudhuri JV, Ohlebusch E, 
Schleiermacher C, Stoye J, Giegerich R 
(2001) REPuter: the manifold applications 
of repeat analysis on a genomic scale. Nucleic 
Acids Res 29(22):4633-4642 


Delcher AL, Kasif S, Fleischmann RD, 
Peterson J, White O, Salzberg SL (1999) 
Alignment of whole genomes. Nucleic Acids 
Res 27(11):2369-2376 

Delcher AL, Phillippy A, Carlton J, Salzberg 
SL (2002) Fast algorithms for large-scale 
genome alignment and comparison. Nucleic 
Acids Res 30(11):2478-2483 

Li RQ, Ye J, Li SG, Wang J, Han YJ, Ye C, 
Wang J, Yang HM, Yu J, Wong GKS, Wang J 
(2005) ReAS: recovery of ancestral sequences 
for transposable elements from the unassem- 
bled reads of a whole genome shotgun. PLoS 
Comput Biol 1(4):313-321. https://doi. 
org/10.1371/Journal.Pcbi.0010043. Artn 
E43 [pii] 

Price AL, Jones NC, Pevzner PA (2005) De 
novo identification of repeat families in large 
genomes. Bioinformatics 21:1351-1358. 
https: //doi.org/10.1093/Bioinformatics/ 
Btil018 


Kurtz S, Narechania A, Stein JC, Ware D 
(2008) A new method to compute K-mer 
frequencies and its application to annotate 
large repetitive plant genomes. BMC Geno- 
mics 9:517. https://doi.org/10.1186/1471- 
2164-9-517 

Lefebvre A, Lecroq T, Dauchel H, Alexandre 
J (2003) FORRepeats: detects repeats on 
entire chromosomes and between genomes. 
Bioinformatics 19(3):319-326. https://doi. 
org/10.1093/Bioinformatics/Btf843 
Crochemore M, Ilie L, Seid-Hilmi E (2006) 
Factor oracles. In: Ibarra OH, Yen H-C (eds) 
Implementation and application of automata. 
Springer, Berlin, pp 78-89 

Agrawal P, States D (1994) The Repeat Pat- 
tern Toolkit (RPT): analyzing the structure 
and evolution of the C. elegans genome. 
Proc Int Conf Intell Syst Mol Biol 2:9 

Bao ZR, Eddy SR (2002) Automated de novo 
identification of repeat sequence families in 
sequenced genomes. Genome Res 12 
(8):1269-1276. https://doi.org/10.1101/ 
Gr.88502 

Edgar RC, Myers EW (2005) PILER: identi- 
fication and classification of genomic repeats. 
Bioinformatics 21:1152-I158. https://doi. 
org/10.1093/Bioinformatics/Btil003 


204 


89 


90. 


91. 


92. 


93. 


94. 


95. 


96. 


97. 


98. 


99. 


100. 


Wojciech Makatowski et al. 


. Edgar RC (2007) PILER-CR: fast and accu- 
rate identification of CRISPR repeats. BMC 
Bioinform 8:18. https://doi.org/10.1186/ 
1471-2105-8-18 

Quesneville H, Bergman CM, Andrieu O, 
Autard D, Nouaud D, Ashburner M, Anxola- 
behere D (2005) Combined evidence annota- 
tion of transposable elements in genome 
sequences. PLoS Comput Biol 1 
(2):166-175. Artn E22. https://doi.org/10. 
1371/Journal.Pcbi.0010022 


Altschul SF, Gish W, Miller W, Myers EW, 
Lipman DJ (1990) Basic local alignment 
search tool. J Mol Biol 215(3):403-410 


Fitch WM (1969) Locating gaps in amino 
acid sequences to optimize the homology 
between two proteins. Biochem Genet 3 
(2):99-108 

Gibbs AJ, McIntyre GA (1970) The diagram, 
a method for comparing sequences. Its use 
with amino acid and nucleotide sequences. 
Eur J Biochem 16(1):1-11 

Sonnhammer EL, Durbin R (1995) A 
dot-matrix program with dynamic threshold 
control suited for genomic DNA and protein 
sequence analysis. Gene 167(1-2):GC1-G10 


Krumsiek J, Arnold R, Rattei T (2007) 
Gepard: a rapid and sensitive tool for creating 
dotplots on genome scale. Bioinformatics 23 
(8):1026-1028. https://doi.org/10.1093/ 
bioinformatics /btm039 


Staton SE, Burke JM (2015) Transposome: a 
toolkit for annotation of transposable element 
families from unassembled sequence reads. 
Bioinformatics 31(11):1827-1829. https:// 
doi.org/10.1093 /bioinformatics/btv059 
Novak P, Neumann P, Pech J, Steinhaisl J, 
Macas J (2013) RepeatExplorer: a Galaxy- 
based web server for genome-wide characteri- 
zation of eukaryotic repetitive elements from 
next-generation sequence reads. Bioinformat- 
ics 29(6):792-793. https://doi.org/10. 
1093 /bioinformatics/btt054 

Koch P, Platzer M, Downie BR (2014) 
RepARK--de novo creation of repeat libraries 
from whole-genome NGS reads. Nucleic 
Acids Res 42(9):e80. https://doi.org/10. 
1093 /nar/gku210 

Chu C, Nielsen R, Wu Y (2016) REPdenovo: 
inferring de novo repeat motifs from short 
sequence reads. PLoS One 11(3):e0150719. 
https: //doi.org/10.1371/journal.pone. 
0150719 

Goubert C, Modolo L, Vieira C, 
ValienteMoro C, Mavingui P, Boulesteix M 
(2015) De novo assembly and annotation of 
the Asian tiger mosquito (Aedes albopictus) 


101. 


102. 


103. 


104. 


105. 


106. 


107. 


108. 


109. 


repeatome with dnaPipeTE from raw geno- 
mic reads and comparative analysis with the 
yellow fever mosquito (Aedes aegypti). 
Genome Biol Evol 7(4):1192-1205. 
https://doi.org/10.1093/gbe/evv050 


Genome of the Netherlands Consortium 
(2014) Whole-genome sequence variation, 
population structure and demographic his- 
tory of the Dutch population. Nat Genet 46 
(8):818-825. https://doi.org/10.1038/ng. 
3021 


1000 Genomes Project Consortium, 
Auton A, Brooks LD, Durbin RM, Garrison 
EP, Kang HM, Korbel JO, Marchini JL, 
McCarthy S, McVean GA, Abecasis GR 
(2015) A global reference for human genetic 
variation. Nature 526(7571):68-74. https:// 
doi.org/10.1038 /nature15393 

UK10K Consortium, Walter K, Min JL, 
Huang J, Crooks L, Memari Y, McCarthy S, 
Perry JR, Xu C, Futema M, Lawson D, 
Iotchkova V, Schiffels S, Hendricks AE, 
Danecek P et al (2015) The UKI1OK project 
identifies rare variants in health and disease. 
Nature 526(7571):82-90. https://doi.org/ 
10.1038 /nature14962 

Lack JB, Lange JD, Tang AD, Corbett-Detig 
RB, Pool JE (2016) A thousand fly genomes: 
an expanded Drosophila genome nexus. Mol 
Biol Evol 33(12):3308-3313. https://doi. 
org/10.1093/molbev/msw195 

Lynch M, Gutenkunst R, Ackerman M, 
Spitze K, Ye Z, Maruki T, Jia Z (2017) Popu- 
lation genomics of Daphnia pulex. Genetics 
206(1):315-332. https://doi.org/10.1534/ 
genetics.116.190611 

Gardner EJ, Lam VK, Harris DN, Chuang 
NT, Scott EC, Pittard WS, Mills RE, Gen- 
omes Project Consortium, Devine SE 
(2017) The Mobile Element Locator Tool 
(MELT): population-scale mobile element 
discovery and biology. Genome Res 27 
(11):1916-1929. https://doi.org/10.1101/ 
gr.218032.116 

Ewing AD (2015) Transposable element 
detection from whole genome sequence 
data. Mob DNA 6:24. https://doi.org/10. 
1186/s13100-015-0055-3 

Rishishwar L, Marino-Ramirez L, Jordan IK 
(2016) Benchmarking computational tools 
for polymorphic transposable element detec- 
tion. Brief Bioinform. https://doi.org/10. 
1093 /bib/bbw072 

Helman E, Lawrence MS, Stewart C, 
Sougnez C, Getz G, Meyerson M (2014) 
Somatic retrotransposition in human cancer 
revealed by whole-genome and exome 


110. 


lll. 


112. 


113. 


114. 


115. 


116. 


117. 


118. 


Transposable Elements: Classification, Identification, and Their Use... 


sequencing. Genome Res 24(7):1053-1063. 
https: //doi.org/10.1101/gr.163659.113 
Lee E, Iskow R, Yang L, Gokcumen O, 
Haseley P, Luquette LJ III, Lohr JG, Harris 
CC, Ding L, Wilson RK, Wheeler DA, Gibbs 
RA, Kucherlapati R, Lee C, Kharchenko PV 
et al (2012) Landscape of somatic retrotran- 
sposition in human cancers. Science 337 
(6097):967-971. https: //doi.org/10.1126/ 
science.1222077 

Tubio JMC, Li Y, Ju YS, Martincorena I, 
Cooke SL, Tojo M, Gundem G, Pipinikas 
CP, Zamora J, Raine K, Menzies A, Roman- 
Garcia P, Fullam A, Gerstung M, Shlien A et al 
(2014) Mobile DNA in cancer. Extensive 
transduction of nonrepetitive DNA mediated 
by Ll retrotransposition in cancer genomes. 
Science 345(6196):1251343. https: //doi. 
org/10.1126/science.1251343 


Thung DT, de Ligt J, Vissers LE, 
Steehouwer M, Kroon M, de Vries P, Slag- 
boom EP, Ye K, Veltman JA, Hehir-Kwa JY 
(2014) Mobster: accurate detection of mobile 
element insertions in next generation 
sequencing data. Genome Biol 15(10):488. 
https://doi.org/10.1186/s13059-014- 
0488-x 

Keane TM, Wong K, Adams DJ (2013) Ret- 
roSeq: transposable element discovery from 
next-generation sequencing data. Bioinfor- 
matics 29(3):389-390. https://doi.org/10. 
1093 /bioinformatics /bts697 

Tang Z, Steranka JP, Ma S, Grivainis M, 
Rodic N, Huang CR, Shih IM, Wang TL, 
Boeke JD, Fenyo D, Burns KH (2017) 
Human transposon insertion profiling: analy- 
sis, visualization and identification of somatic 
LINE-1 insertions in ovarian cancer. Proc 
Natl Acad Sci U S A 114(5):E733-E740. 
https: //doi.org/10.1073/pnas. 
1619797114 

Chen Y, Ye W, Zhang Y, Xu Y (2015) High 
speed BLASTN: an accelerated MegaBLAST 
search tool. Nucleic Acids Res 43 
(16):7762-7768. https: //doi.org/10.1093/ 
nar/gkv784 

Kent WJ (2002) BLAT--the BLAST-like 
alignment tool. Genome Bes 12 
(4):656-664. https://doi.org/10.1101/gr. 
229202. Article published online before 
March 2002 

Kielbasa SM, Wan R, Sato K, Horton P, Frith 
MC (2011) Adaptive seeds tame genomic 
sequence comparison. Genome Res 21 
(3):487-493. https://doi.org/10.1101/gr. 
113985.110 

Casper J, Zweig AS, Villarreal C, Tyner C, 
Speir ML, Rosenbloom KR, Raney BJ, Lee 


119. 


120 


121. 


122. 


123. 


124. 


125. 


126. 


127. 


128. 


205 


CM, Lee BT, Karolchik D, Hinrichs AS, 
Haeussler M, Guruvadoo L, Navarro 
Gonzalez J, Gibson D et al (2018) The 
UCSC Genome Browser database: 2018 
update. Nucleic Acids Res 46(D1): 
D762-D769. https: //doi.org/10.1093/ 
nar/gkx1020 


Noll A, Grundmann N, Churakov G, 
Brosius J, Makalowski W, Schmitz J (2015) 
GPAC-genome presence/absence compiler: a 
web application to comparatively visualize 
multiple genome-level changes. Mol Biol 
Evol 32(1):275-286. https://doi.org/10. 
1093 /molbev/msu276 


. Abrusan G, Grundmann N, DeMester L, 


Makalowski W (2009) TEclass-a tool for 
automated classification of unknown eukary- 
otic transposable elements. Bioinformatics 25 
(10):1329-1330. https://doi.org/10.1093/ 
bioinformatics /btp084 

Jurka J, Klonowski P, Dagman V, Pelton P 
(1996) Censor - a program for identification 
and elimination of repetitive elements from 
DNA sequences. Comput Chem 20 
(1):119-121 

Bao W, Kojima KK, Kohany O (2015) 
Repbase Update, a database of repetitive ele- 
ments in eukaryotic genomes. Mob DNA 
6:11. https: //doi.org/10.1186/s13100- 
015-0041-9 


Wicker T, Matthews DE, Keller B (2002) 
TREP: a database for Triticeae repetitive ele- 
ments. Trends Plant Sci 7(12):561-562. [pii] 
$1360-1385(02)02372-5 


McCarthy EM, McDonald JF (2003) 
LTR_STRUC: a novel search and identifica- 
tion program for LTR retrotransposons. Bio- 
informatics 19(3):362-367. https://doi.org/ 
10.1093 /Bioinformatics/Btf878 


Kalyanaraman A, Aluru S (2006) Efficient 
algorithms and software for detection of full- 
length LTR retrotransposons. J Bioinforma 
Comput Biol 4(2):197-216. 
$021972000600203X [pii] 

Rho M, Choi JH, Kim S, Lynch M, Tang H 
(2007) De novo identification of LTR retro- 
transposons in eukaryotic genomes. BMC 
Genomics 8:90. https://doi.org/10.1186/ 
1471-2164-8-90. 1471-2164-8-90 [pii] 

Xu Z, Wang H (2007) LTR_FINDER: an 
efficient tool for the prediction of full-length 
LTR retrotransposons. Nucleic Acids Res 35 
(Web Server issue):W265—-W268. https:// 
doi.org/10.1093/nar/gkm286.  gkm286 
[pii] 

Ellinghaus D, Kurtz S, Willhoeft U (2008) 
LTRharvest, an efficient and flexible software 


206 


129. 


130. 


131. 


132. 


133. 


134. 


135. 


136. 


137. 


Wojciech Makatowski et al. 


for de novo detection of LTR retrotranspo- 
sons. BMC Bioinform 9:18. https://doi.org/ 
10.1186/1471-2105-9-18. 1471-2105-9- 
18 [pii] 

Lerat E (2010) Identifying repeats and trans- 
posable elements in sequenced genomes: how 
to find your way through the dense forest of 
programs. Heredity 104(6):520-533. 
https: //doi.org/10.1038 /hdy.2009.165. 
hdy2009165 [pii] 

Tu Z (2001) Eight novel families of miniature 
inverted repeat transposable elements in the 
African malaria mosquito, Anopheles gam- 
biae. Proc Natl Acad Sci U S A 98 
(4):1699-1704. https://doi.org/10.1073/ 
pnas.041593198. 041593198 [pii] 

Chen Y, Zhou F, Li G, Xu Y (2009) MUST: a 
system for identification of miniature 
inverted-repeat transposable elements and 
applications to Anabaena variabilis and Halo- 
quadratum walsbyi. Gene 436(1-2):1-7. 
https: //doi.org/10.1016/j.gene.2009.01. 
019. S0378-1119(09)00051-1 [pii] 

Du C, Caronna J, He L, Dooner HK (2008) 
Computational prediction and molecular 
confirmation of Helitron transposons in the 
maize genome. BMC Genomics 9:51. 
https: //doi.org/10.1186/1471-2164-9-51. 
1471-2164-9-51 [pii] 

Yang L, Bennetzen JL (2009) Structure- 
based discovery and description of plant and 
animal Helitrons. Proc Natl Acad Sci U S A 
106(31):12832-12837. https://doi.org/10. 
1073/pnas.0905563106. 0905563106 [pii] 


Feschotte C, Keswani U, Ranganathan N, 
Guibotsy ML, Levine D (2009) Exploring 
repetitive DNA landscapes using REPCLASS, 
a tool that automates the classification of 
transposable elements in eukaryotic genomes. 
Genome Biol Evol 1:205-220. https://doi. 
org/10.1093/Gbe/Evp023 


Lowe TM, Eddy SR (1997) tRNAscan-SE: a 
program for improved detection of transfer 
RNA genes in genomic sequence. Nucleic 
Acids Res 25(5):955-964 


Flutre T, Duprat E, Feuillet C, Quesneville H 
(2011) Considering transposable element 
diversification in de novo annotation 
approaches. PLoS One  6(1):e16526. 
https: //doi.org/10.1371/journal.pone. 
0016526 


Grabherr MG, Haas BJ, Yassour M, Levin JZ, 
Thompson DA, Amit I, Adiconis X, Fan L, 
Raychowdhury R, Zeng QD, Chen ZH, 
Mauceli E, Hacohen N, Gnirke A, Rhind N 
et al (2011) Full-length transcriptome assem- 
bly from RNA-Seq data without a reference 


138. 


139. 


140. 


141. 


142. 


143. 


144 


145 


146 


genome. Nat Biotechnol 29(7):644-U130. 
https: //doi.org/10.1038 /nbt.1883 


Churakov G, Grundmann N, Kuritzin A, 
Brosius J, Makalowski W, Schmitz J (2010) 
A novel web-based TinT application and the 
chronology of the Primate Alu retroposon 
activity. BMC Evol Biol 10:376. https://doi. 
org/10.1186/1471-2148-10-376. 1471- 
2148-10-376 [pii] 

Kriegs JO, Matzke A, Churakov G, 
Kuritzin A, Mayr G, Brosius J, Schmitz J 
(2007) Waves of genomic hitchhikers shed 
light on the evolution of gamebirds (Aves: 
Galliformes). BMC Evol Biol 7:190. https: // 
doi.org/10.1186/1471-2148-7-190. 1471- 
2148-7-190 [pii] 

Nilsson MA, Churakov G, Sommer M, Tran 
NV, Zemann A, Brosius J, Schmitz J (2010) 
Tracking marsupial evolution using archaic 
genomic retroposon insertions. PLoS Biol 8 
(7):e1000436. https: //doi.org/10.1371/ 
journal.pbio.1000436 


Kriegs JO, Zemann A, Churakov G, 
Matzke A, Ohme M, Zischler H, Brosius J, 
Kryger U, Schmitz J (2010) Retroposon 
insertions provide insights into deep lago- 
morph evolution. Mol Biol Evol 27 
(12):2678-2681. https://doi.org/10.1093/ 
molbev/msq162. msq162 [pii] 

Baker JN, Walker JA, Vanchiere JA, Phillippe 
KR, St Romain CP, Gonzalez-Quiroga P, 
Denham MW, Mierl JR, Konkel MK, Batzer 
MA (2017) Evolution of Alu subfamily struc- 
ture in the Saimiri lineage of new world 
monkeys. Genome Biol Evol 9 
(9):2365-2376. https://doi.org/10.1093/ 
gbe/evx172 

Luchetti A, Plazzi F, Mantovani B (2017) 
Evolution of two short interspersed elements 
in Callorhinchus milii (Chondrichthyes, 
Holocephali) and related elements in sharks 
and the coelacanth. Genome Biol Evol 9(6). 
https://doi.org/10.1093/gbe/evx094 


. Gotea V, Petrykowska HM, Elnitski L (2013) 


Bidirectional promoters as important drivers 
for the emergence of species-specific tran- 
scripts. PLoS One 8(2):e57323. https://doi. 
org/10.1371/journal.pone.0057323 


. Kostka D, Hubisz MJ, Siepel A, Pollard KS 


(2012) The role of GC-biased gene conver- 
sion in shaping the fastest evolving regions of 
the human genome. Mol Biol Evol 29 
(3):1047-1057. https://doi.org/10.1093/ 
molbev/msr279 


. Capra JA, Hubisz MJ, Kostka D, Pollard KS, 


Siepel A (2013) A model-based analysis of 
GC-biased gene conversion in the human 
and chimpanzee genomes. PLoS Genet 9(8): 


Transposable Elements: Classification, Identification, and Their Use... 207 


e1003684. https: //doi.org/10.1371/jour https: //doi.org/10.1016/j.ygeno.2014.04. 
nal.pgen.1003684 001 

147. Gotea V, Elnitski L (2014) Ascertaining 148. Makalowski W (2000) Genomic scrap yard: 
regions affected by GC-biased gene conver- how genomes utilize all that junk. Gene 259 
sion through weak-to-strong mutational hot- (1-2):61-67. https: //doi.org/10.1016/ 
spots. Genomics 103(5-6):349-356. S$0378-1119(00)00436-4 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International 
License (http: //creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution 
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons licence and indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative Commons 
licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s 
Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the 
permitted use, you will need to obtain permission directly from the copyright holder. 


Part Ill 


Phylogenomics and Genome Evolution 


Check for 
updates 


Modern Phylogenomics: Building Phylogenetic Trees 
Using the Multispecies Coalescent Model 


Liang Liu, Christian Anderson, Dennis Pearl, and Scott V. Edwards 


Abstract 


The multispecies coalescent (MSC) model provides a compelling framework for building phylogenetic trees 
from multilocus DNA sequence data. The pure MSC is best thought of as a special case of so-called 
“multispecies network coalescent” models, in which gene flow is allowed among branches of the tree, 
whereas MSC methods assume there is no gene flow between diverging species. Early implementations of 
the MSC, such as “parsimony” or “democratic vote” approaches to combining information from multiple 
gene trees, as well as concatenation, in which DNA sequences from multiple gene trees are combined into a 
single “supergene,” were quickly shown to be inconsistent in some regions of tree space, in so far as they 
converged on the incorrect species tree as more gene trees and sequence data were accumulated. The 
anomaly zone, a region of tree space in which the most frequent gene tree is different from the species tree, 
is one such region where many so-called “coalescent” methods are inconsistent. Second-generation 
implementations of the MSC employed Bayesian or likelihood models; these are consistent in all regions 
of gene tree space, but Bayesian methods in particular are incapable of handling the large phylogenomic 
data sets currently available. Two-step methods, such as MP-EST and ASTRAL, in which gene trees are first 
estimated and then combined to estimate an overarching species tree, are currently popular in part because 
they can handle large phylogenomic data sets. These methods are consistent in the anomaly zone but can 
sometimes provide inappropriate measures of tree support or apportion error and signal in the data 
inappropriately. MP-EST in particular employs a likelihood model which can be conveniently manipulated 
to perform statistical tests of competing species trees, incorporating the likelihood of the collected gene 
trees on each species tree in a likelihood ratio test. Such tests provide a useful alternative to the multilocus 
bootstrap, which only indirectly tests the appropriateness of competing species trees. We illustrate these 
tests and implementations of the MSC with examples and suggest that MSC methods are a useful class of 
models effectively using information from multiple loci to build phylogenetic trees. 


Key words Introgression, Hybridization, Coalescent, Recombination, Neutrality, Molecular 
evolution 


1 Introduction 


The concept of a phylogeny or “species tree,” a bifurcating den- 
drogram graphically depicting the relationships among a group 
species, is one of the oldest and most powerful icons in all of 
biology. After Charles Darwin sketched the first species tree 
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1.1 Stopgap 
Approaches to Gene 
Tree Heterogeneity 


(in Transmutation of Species, Notebook B, 1837), he remained 
fascinated by the image for 22 years, eventually including a species 
tree as the only figure in On the Origin of Species [1]. Though 
species trees reached their aesthetic apogee with Ernst Haeckel’s 
Tree of Life in 1886, the pursuit of ever-more scientifically accurate 
trees has kept phylogenetics a vibrant discipline for the 150 years 
since. 

Because the direct evolution of species in the past is not observ- 
able (not even in the fossil record), relationships among species are 
often inferred by shared characteristics among extant taxa. Until the 
1970s, this effort took place almost exclusively by using morpho- 
logical characters. Although this approach had many successes, the 
paucity of characters and the challenges of comparing species with 
no obvious morphological homologies were persistent problems 
[2, 3]. When molecular techniques were developed in the late 
1960s, it soon became clear that the sheer volume of molecular 
data that could be collected would represent a vast improvement. 
When DNA sequences became widely available for a range of 
species [4], molecular comparisons quickly became de rigueur 
[5-8]. Nonetheless, it was recognized early on that molecular 
phylogenies had their own suite of problems; the concept that not 
all gene tree topologies would match the true species tree topology 
(i.e., would not be speciodendric sensu Rosenberg [9 ]) was implicit 
in early empirical allozyme and mitochondrial DNA studies 
[10, 11]. However, it was generally assumed that the idiosyncratic 
genealogical history of any one gene, as reconstructed from extant 
mutations, was an acceptable approximation for the true history of 
the species given the potentially overwhelming quantity and seduc- 
tive utility of molecular data [12-15]. Indeed, this assumption is 
still prevalent in the thinking of those who favor concatenation or 
supermatrix approaches as a means of combining information from 
multiple genes that may still differ in their genealogy from each 
other and from the species tree [16, 17]. In the meantime, the term 
“phylogeny” frequently became conflated with “gene tree,” the 
entity produced by many of the leading phylogenetic packages of 
the day. The term “species tree,” in use since the late 1970s to 
emphasize the distinction between lineage histories and gene his- 
tories (reviewed in [11, 18]), was only gradually acknowledged, 
despite the fact that species trees are the rightful heirs to the term 
“phylogeny” and better encapsulate the true goals of molecular and 
morphological systematics [19]. 


By and large, the ensuing decades of molecular phylogenetics has 
fulfilled much of its potential, revolutionizing taxonomies and 
resolving conundrums previously considered intractable. However, 
as the amount of genetic data per species becomes ever-more 
voluminous, it has become clear that the conflicts between individ- 
ual genes with each other and with the overarching species tree, 
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both in topology and branch lengths, can have practical conse- 
quences for phylogenetic analysis if not dealt with properly 
[18-23]. At first, some researchers treated this phenomenon as 
though it were an information problem: when working with only 
a few mutations, you were bound to occasionally get unlucky and 
sequence a gene whose random signal of evolution did not match 
that of the taxa being studied. The reasoning was surely more 
and/or longer sequences would fix that problem and cause gene 
trees to converge [16]. However, as more genes were sequenced, 
and as the properties of gene lineages within populations were 
studied in detail [24, 25], the twin realities of gene tree heteroge- 
neity and “incomplete lineage sorting” [11] (ILS) became clear 
(Figs. 1 and 2). The probability of an event such as incomplete 
lineage sorting, which if considered alone would lead to inferring 
the wrong species tree, was worked out theoretically for the four 
allele /two species case first [26], followed by the three allele /three 
species case [7, 13] and more general cases [12, 27]. Pamilo and 
Nei [12] were among those that proposed that the solution was to 
simply acquire more gene sequences, after which the central ten- 
dency of this gene set would point to the correct relationships, a 
“democratic vote” method, where each gene was allowed to pro- 
pose its own tree, and the topology with the most “votes” was 
declared the winner and therefore the true phylogeny. Though 
generally true for three-species case, it can sometimes produce the 
wrong topology with four or more species [28]. In fact, we now 
know that with four or more species, there is an “anomaly zone” for 
species trees with short branch lengths as measured in coalescence 
units, in which the addition of more genes for sampled taxa is 
guaranteed to lead to the wrong species tree topology for the 
democratic vote method [29, 30]. (Coalescent time units, equiva- 
lent to t/Ne where tis the number of generations since divergence 
and Ne is the effective population size of the lineage, are a conve- 
nient unit for discussions of gene tree/species tree heterogeneity. 
For a clear explanation, see Box 2 of Degnan and Rosenberg [28 ].) 
Such anomaly zones may be rare empirically [31], but empirical 
examples are emerging [32, 33], and even the theoretical possibility 
remains disconcerting. In addition, because the number of possible 
tree topologies increases as the double factorial of the number of 
tips, for species trees with more than four tips, a very large number 
of genes are required to determine which gene tree is in fact the 
most frequent. Advanced consensus methods [34] can circumvent 
some of the problems of the democratic vote by using novel assem- 
bly methods, such as rooted triple consensus [35], greedy consen- 
sus [36], or supertree methods [37]. However, although such 
methods suffer from lack of a biological model motivating the 
method of consensus, approaches such as that proposed by Steel 
and Rodrigo [38] might help approximate the dynamics of 
biological models while allowing for faster and more flexible exten- 
sions and should be further developed. 
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Fig. 1 An example showing the utility of multiple gene trees in producing species tree topologies. (a) Nine 
unlinked loci are simulated (or inferred without error) from a species group with substantial amounts of 
incomplete lineage sorting. Note that no single gene recovers the correct relationship between clades. 
Furthermore, despite identical conditions for all nine simulations, no two genes agree on the correct topology, 
let alone the correct divergence times. (b) Superimposing the nine gene trees on top of each other clarifies the 
relationships. It can be (correctly) inferred that the true tree is perfectly ordered, with (ABC) diverging from D 
about 1500 generations ago, the (AB)-C split occurring at 800, and A diverging from B about 600 generations 
ago. Also, the amount of crossbreeding within the recently diverged taxa implies (correctly) that C has the 
effective smallest population size 


The second empirical approach to the problem of conflicting 
gene trees was to bypass it altogether. Concatenation methods 
appended the sequence of one gene to that of the next, to create 
long alignments or supermatrices [39], a technique that in some 
situations was superior to standard consensus methods in resolving 
discordance or achieving statistical consistency [40]. But some 
researchers, including those who questioned the “total-evidence” 
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Deep coalescence Branch length heterogeneity 


Fig. 2 The relationship between gene trees and species trees. Lines within the 
species trees indicate gene lineages. Simplified gene trees are shown below 
each species tree. Whereas gene trees on the left vary due to deep coalescence, 
gene trees on the right are topologically concordant but vary slightly in branch 
lengths due to the coalescent. Modified with permission from [19] 


approach to systematics (e.g., [41]), advocated against concatena- 
tion when, for whatever reason, gene trees appeared to conflict with 
one another. One problem with the concatenation approach was 
that it assumed full linkage across the supermatrix, a situation that 
would obviously not be the case if genes were on different chromo- 
somes. Even when the lineage lengths in a species tree are long in 
coalescent units, such that gene tree topologies are congruent, the 
branch lengths of trees of genes on different chromosomes will 
differ subtly from one another due to the stochasticity of the 
coalescent process. The early implementations of this method also 
assumed the same distribution of mutation rates across the 
sequence, which was clearly not the case if the matrix included 
coding and noncoding regions. Like democratic vote methods, 
concatenation of many genes was sometimes defended as sufficient 
to override the conflicting signal across genes [42, 43], despite 
widespread acknowledgment that gene tree heterogeneity is ubiq- 
uitous and that concatenation can sometimes give the wrong 
answer, especially although not exclusively in the anomaly zone 
[44, 45]. 

Concatenation as a method of combining phylogenomic data 
still remains popular by default [16, 46], particularly among phylo- 
genetic studies of higher taxa where incomplete lineage sorting is 
assumed to be rare. However, this logic suffers from two flaws 
frequently seen in the literature. First, “deep” phylogenetic studies 
among higher taxa are no more immune to the problems of ILS 
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than are studies among closely related species, because it is the 
length of a given branch, not its depth in the tree, that is relevant 
to probability of gene tree discordance [28]. Detecting such ILS 
and ruling out gene tree congruence will indeed be more challeng- 
ing in deep phylogenomic studies, but it should not be assumed 
that ILS will be less prevalent at deep scales than at shallow scales. 
Second, current implementations of concatenation represent only 
one way of species tree construction in which each gene is forced to 
have the same topology. The real distinction between concatena- 
tion and coalescent models is not the presence or absence of ILS 
but rather the possibility of conditional independence of gene trees 
as mediated by recombination between genes [47]. Even if all gene 
trees in an analysis are topologically identical, physically connecting 
different genes in a single supermatrix does not capture variation in 
branch lengths that recombination will allow in nature. More effort 
should be devoted to “supermatrix-like” methods that constrain 
gene trees to the same topology but allow recombination between 
genes and conditional independence of branch lengths, since these 
qualities will influence how signal is accumulated as more genes are 
added [47]. A final problem with concatenation is that, in a strict 
sense, concatenation also does not generate species trees, in so far as 
the method treats all nucleotides as if they were part of a single 
non-recombining gene, and thus does not distinguish between 
gene and species trees [19]. In the end, concatenation is best 
thought of as a special case of more general models of phylogenetic 
inference that acknowledge gene tree heterogeneity and condi- 
tional independence of genes. One such model is the multispecies 
coalescent model [23, 28, 48]. It is this model that provides the 
basis for a recent flurry of promising methods that permit efficient 
and consistent estimation of species trees under a variety of 
conditions. 


2 The Multispecies Coalescent Model 


A plausible probabilistic model for analyzing multilocus sequences 
should involve not only the phylogenetic relationship of species 
(species tree) but also the genealogical history of each gene (gene 
tree) and allow different genes to have different histories. Unlike 
concatenation, such a multispecies coalescent model (MSC) 
explains the evolutionary history of multilocus sequences through 
two levels of biological hierarchy, the gene tree and the species tree, 
rather than just one [23, 49]. Models acknowledging these two 
levels require an explicit description of how sequences evolve on 
gene trees, the traditional likelihood equation of Felsenstein [50] 
and others, as well as how gene trees evolve in the species tree, the 
likelihood for which was first described by Rannala and Yang 
[48]. With a few exceptions (described below), the genealogical 
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relationship (gene tree) of neutral alleles can be simply depicted by a 
coalescence process in which lineages randomly coalesce with each 
other backward in time. The MSC is a simple application of the 
single population coalescent model to each branch in a species tree 
[28]. It holds the standard assumptions found in many neutral 
coalescent models: no natural selection or gene flow among popu- 
lations, no recombination within loci but free recombination 
between loci, random mating and a Wright-Fisher model of inheri- 
tance down each branch of the species tree. Despite these seemingly 
oversimplified assumptions, the pure coalescent model is funda- 
mental in explaining the gene tree-species tree relationship because 
it forms a baseline for incorporating additional evolutionary forces 
on top of random drift [28, 49]. More importantly, the pure 
coalescent model provides an analytic tool to detect the evolution- 
ary forces responsible for the deviation of the observed data 
(molecular sequences) from those expected from the model. 

The coalescent process works, in effect, by randomly choosing 
ancestors with replacement from the population backward through 
time for each sequence in the original sample. Eventually, two of 
these lineages will share a common ancestor, and the lineages are 
said to “coalesce.” The process continues until all lineages coalesce 
at the most recent common ancestor (MRCA). Multispecies coa- 
lescence works the same way but places constraints on how recently 
the coalescences occur, corresponding to the species’ divergence 
times. Translating this model into computer algorithms for infer- 
ring species trees has led to a plethora of models [51-55], some of 
which first build gene trees by traditional methods and then com- 
bine them into a species tree with the highest likelihood or other 
criteria (“two-step” methods, e.g., [56] or [57]), others of which, 
particularly Bayesian methods [58-60], simultaneously estimate 
gene trees and species tree. In general for likelihood or Bayesian 
approaches, a species tree has been proposed, and the likelihood of 
each gene tree is evaluated using the MSC, with or without various 
priors, to evaluate the likelihood of the data (DNA sequences in the 
case of Bayesian methods or gene trees in the case of likelihood 
methods like MP-EST [56]) given the species tree or the posterior 
probability of the species tree. In this way, traditional multispecies 
coalescent methods are the converse of consensus methods; rather 
than each locus proposing a potentially divergent species tree, a 
common species tree is assumed and evaluated, given the some- 
times divergent patterns observed among multiple loci. 

A number of implementations of this idea have been developed 
(reviewed by Edwards [19, 54]). Several “two-step” packages are 
available for moving from independently built gene trees to species 
trees, including minimization of deep coalescence [61], STEM 
[62], JIST [63], GLASS [64], STAR, STEAC [65], NJst [66], 
and ASTRAL [57, 67]. Three methods to date utilize “one-step” 
Bayesian methods to infer gene trees and the species tree, with the 
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Violations of the 
Multispecies 
Coalescent Model 


2.1.1 Population 
Processes 


Delimitation of Species and 
Diverging Lineages 


input data being DNA sequences: BEST [58, 68, 69], *BEAST2 
[59], and a new model (A00) in the Bayesian Phylogenetics and 
Phylogeography (bpp) package [70-72]. An additional “one-step” 
method, SVD Quartets [73], derives species trees directly from 
aligned, unlinked single-nucleotide polymorphisms using the 
method of invariants in a coalescent framework. Species tree meth- 
ods exhibit a number of attractive advantages over concatenation 
methods in terms of performance. These advantages are not 
restricted to the anomaly zone, occur across broad regions of tree 
space, and include less susceptibility to long-branch attraction [74] 
and missing data [75]. Another attractive aspect of species tree 
methods and multispecies coalescent models is that they deliver 
more appropriate estimated levels of confidence that are more 
evenly spread across genes and appear to be less susceptible to the 
inflation of posterior probabilities that was early on attributed to 
Bayesian analyses (e.g., [76, 77]) but may also be due to model 
misspecification due to concatenation [53]. Bayesian methods are 
generally agreed to be the most efficient and accurate, capturing all 
details of the MSC model seamlessly [52]. However, one drawback 
is that the estimation of larger numbers of parameters (population 
sizes and divergence times in addition to topologies) can slow 
computation, may not be relevant in some situations [78], and is 
generally not possible with the large data sets that are routinely seen 
today in phylogenomics [59]. Thus far, two-step methods such as 
ASTRAL, STAR, NJst, and MP-EST have proven the most widely 
used for large-scale phylogenomic studies, such as the Avian Phy- 
logenomics Project [79 ] and large-scale phylogenomics of fish [80] 
and plants [81]. 


The “standard” and most common reason why gene trees are not 
speciodendritic is incomplete lineage sorting, i.e., lineages have not 
yet been reproductively isolated for long enough for drift to cause 
complete genetic divergence in the form of reciprocal monophyly 
of gene trees ([82]; Figs. 1 and 2). This source of gene tree 
heterogeneity is guaranteed to be ubiquitous, if only because it 
arises from the finite size populations of all species that have ever 
come into existence. Almost all the techniques and software 
packages discussed above are designed to approximate uncertainties 
in species tree topology arising from this phenomenon. 


For recent divergences, the definition of “species” can become 
problematic for species tree methods [63], and the challenge of 
delimiting species has, if anything, increased now that the overly 
conservative strictures of gene tree monophyly as a delimiter of 
species have been mostly abandoned [82]. This fundamental issue 
in a phylogenetic study—whether the extent of divergence among 
lineages warrants species status—has not gone away in the genomic 
era. However, traditional species tree methods using the MSC need 
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not use “good” species as OTUs; they will work perfectly well on 
lineages that have recently diverged, so long as they have ceased 
exchanging genes. The key issue is not whether the OTUs in 
species tree analyses are in fact species but rather whether they 
have ceased exchanging genes, which has been shown to compro- 
mise traditional MSC methods [83, 84] (see below). 

The problem of species delimitation may ultimately be solved 
by data other than genetics, and today few species concepts use 
strictly genetic criteria [85]. Some have suggested that the line 
between a population-level difference and a species-level difference 
can be drawn empirically and with consistency in well-studied taxa 
such as birds, using morphological, environmental, and behavioral 
data simultaneously [86]. Thus, there is some hope that species 
delimitation can be performed rigorously a priori in many cases. 
Researchers who opt for delimiting species primarily with molecu- 
lar data have a wide array of techniques and prior examples available 
to them, although not all without controversy [71, 87-93]. Recent 
progress in species delimitation is motivated by the conceptual 
transition from “biological/reproductive isolation species” to the 
“lineage species concept,” which defines species not in terms of 
monophyly of gene lineages but as population lineage segments in 
the species tree [93]. Under that expanded concept, boundaries of 
species (De, lineages in the species tree) can be facilitated by 
collection and analysis of gene trees in the framework of the 
multispecies coalescent model [72]. The recent suggestion that 
coalescent species delimitation methods define only structure but 
not species [90] was, in our view, already well-established, with 
confusion stemming largely from the term “species delimitation,” 
as opposed to “delimitation of populations between which gene 
flow has ceased.” 


There are a number of other situations in which the assumptions 
of the coalescent are violated. MSC models involve a series of 
isolation events unaccompanied by gene flow. In this regard, 
they are like the isolation-migration models of phylogeography 
[94, 95] but without the migration. The assumption of no gene 
flow naturally restricts their utility, but gene flow of course com- 
promises other methods of phylogenetic inference, including con- 
catenation methods, as well. Additionally, situations in which gene 
flow yields a prominent molecular signal often are detectable 
primarily among very closely related species in the realm of phy- 
logeography [96]. If some substantial gene flow continues 
between species after divergence, then the multispecies coalescent 
can quickly destabilize, especially for a small number of loci and as 
the rate of genetic introgression increases (Fig. 6 in [87, 97-99]). 
We recommend model comparison algorithms like PHRAPL [87] 
for determining whether a given data set conforms to the assump- 
tions of the MSC. 
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2.1.2 Molecular 
Processes 


In addition to species delimitation and gene flow, there are at least 
three mechanisms that generate discordance on the molecular level 
(Fig. 3). These include horizontal gene transfer (HGT), which can 
pose a serious risk to phylogenetic analysis; gene duplication, whose 
risks can be avoided by certain models; and natural selection, which 
generally poses no direct threat but, depending on its mode of 
action and consequences for DNA and protein sequences, can be 
the most challenging of all. 


TRUE HISTORY INFERRED 
HISTORY 
Horizontal Gene Transfer 


A BC D AB CD 


Gene Duplication 


Convergent Evolution 


Mutation 
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Fig. 3 Three examples of gene histories that depart from the standard 
multispecies coalescent model. (a) A duplication event that precedes a 
speciation event can lead to incorrect inference of divergence times in the 
species tree if copy 1 is compared to copy 2. This can be particularly difficult 
if one of the gene copies has been lost or not sequenced by the researcher. (b) 
Convergent evolution can occur at the molecular level, for example, in certain 
genes under strong natural selection or highly biased mutational processes. 
These processes will tend to bring together distantly related taxa in the 
phylogenetic tree and are likely to be given additional false support by 
morphological data. (c) Horizontal gene transfer causes difficulties in some 
current species tree methods, because it establishes a spurious lower bound 
to divergence times. Though rare in eukaryotes, it is by no means unknown and 
is likely to become a more difficult problem in the future when species trees are 
based on tens of thousands of loci 
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HGT is now known to be so widespread across the Tree of Life, 
especially in prokaryotes, that some have suggested a web of life 
may be a more appropriate paradigm for phylogenetic change 
[100-102]. Growing evidence shows that even eukaryotic gen- 
omes contain substantial amounts of “uploaded” genetic material 
from bacteria, archaea, viruses, and even fellow eukaryotes 
[103-105]. Even though effective techniques are not yet widely 
available for detecting HGT in eukaryotes, enough individual cases 
have been “accidentally” discovered that researchers have given up 
trying to list them all [103]. 

The implications of HGT for species tree construction vary 
depending on the method used. For example, following the stan- 
dard assumption in coalescent theory that allelic divergences must 
occur earlier in time than the divergences of species harboring those 
alleles, some species tree techniques [48, 58], as well as classical 
approaches (eg, [13]), assume that the gene tree exhibiting the 
most recent divergence between taxon A and taxon B establishes a 
hard upper limit on the divergence time of those species in the 
species tree. For small sets of genes in taxa where HGT is rare, a 
researcher would need to be quite unlucky to choose a horizontally 
transferred gene for analysis. However, as the genomic era 
advances, it becomes more likely that at least one of the thousands 
of genes studied will have been transferred horizontally and thus 
establish a spurious upper bound for clade divergence at the species 
level. When selective introgression of genes from one species to 
another is considered, this number of genes coalescing recently 
between species will increase [106]. Although HGT is clearly a 
problem for some current methodologies, if transferred genes can 
first be identified, then they could be extremely useful as genomic 
markers for monophyletic groups that have inherited such genes 
and would otherwise be difficult to resolve [107]. However, for 
other species tree methods that calculate averages of coalescence 
times, such as STAR [65], HGT events will have less of an impact. 
Liu et al. [56] examined the effect of HGT on the pseudo- 
likelihood method MP-EST and predicted that, mathematically, 
species tree branch lengths may be biased by HGT but that topol- 
ogies were fairly robust. Davidson et al. [108] found that quartet- 
based methods, such as ASTRAL-II, were fairly robust to HGT in 
the presence of ILS. Removal of genes suspected to be transferred 
via HGT prior to species tree analysis would be warranted; how- 
ever, some methods to detect such events rely both on having the 
true species tree already in hand and also on the absence of other 
mechanisms causing gene tree discordance [109-112]. Recent 
work aims to incorporate HGT into other mechanisms of gene 
tree incongruence (reviewed in [113]); how much we need to 
invest in such synthetic methods will likely depend on the preva- 
lence of HGT in particular taxonomic groups. 
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Gene Duplication 
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Gene duplication presents another violation of the basic MSC 
model (Fig. 3); like HGT, its potential problems are worst when 
they go unrecognized [49]. Imagine a taxon where a gene of 
interest duplicated 10 Mya into copy o and copy f; the taxon then 
split 5 Mya into species 1 and 2. A researcher investigating the 
daughter species would therefore sequence four orthologous 
genes, with the potential to compare ol to Bi and Bl to a2 and 
thus generate two gene trees where the estimated split time was 
10 Mya, rather than 5 Mya. Such a situation will be easily recog- 
nized if copy o and p have diverged sufficiently by the time of their 
duplication, and a number of methods of coalescent analysis have 
incorporated gene duplication (eg, [114, 115]; reviewed in 
[116]). Additionally, failure to recognize the situation may not 
have drastic consequences for phylogenetic analysis if the paralogs 
have coalesced very recently or are species-specific, in which case the 
estimated gene coalescence would be approximately correct no 
matter which comparison was made. However, if one of the copies 
has been lost and only one of the remaining copies is sequenced, 
then the chances of inferring an inappropriately long period of 
genetic isolation are larger and will increase as the size of the family 
of paralogs increases. Assessing paralogs in phylogenomic data is a 
major challenge, particularly in groups like plants and fish, and a 
growing number of dedicated methods ([117]; assessed in [118]) 
or filtering protocols [119] for doing so exist. This problem will 
tend to overestimate gene coalescence times, and some species tree 
methods depend on minimum isolation times among a large set of 
genes. These deep coalescences might spuriously increase inferred 
ancestral population sizes. A systematic search for biases incurred by 
species tree methods due to gene duplication is needed. 


Natural selection causes yet another violation of the multispecies 
coalescent model. Selection can cause serious problems in some 
cases, although in other circumstances it is predicted not to cause 
problems of phylogenetic analysis [47, 120]. The usual stabilizing 
selection can be helpful to taxonomists working at high levels 
because it slows the substitution rate; likewise selective sweeps, 
directional selection, and genetic surfing [121] tend to clarify 
phylogenetic relationships by accelerating reciprocal monophyly 
for genes in rapidly diverging clades. However, challenges to phy- 
logenetic inference are posed by any evolutionary force that may 
bias the reconstruction of gene trees, including convergent neutral 
mutations (homoplasy), balancing selection, and selection-driven 
convergent evolution (e.g., [122]). Balancing selection tends to 
preserve beneficial alleles at a gene for long periods of time and is 
probably the most insidious form of selection with respect to 
accurately reconstructing gene trees and species trees. 


2.2 More About 
Violations and Model 
Fit of the Multispecies 
Coalescent Model 


2.2.1 Phylogenetic 
Outlier Loci 
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Many of the instances of violations of the coalescent model will 
occur at individual genes and usually will not dominate the signal 
of the entire suite of genes sampled for phylogenetic analysis. Reid 
et al. [123] conducted one of the few tests of the fit of the MSC to 
multilocus phylogenetic data. Although the title of their article 
suggests that the MSC overall provides a “poor fit” to empirical 
data, we suggest that their results provide a more hopeful picture. 
The most important thing is that they investigated the fit of the 
MSC to individual loci in phylogenetic data sets and were able to 
identify loci that failed to fit the MSC. They were less successful at 
identifying the causes of departure from the MSC for individual loci. 

More common but still rare are efforts to determine which 
models of phylogenetic inference, the MSC or concatenation, pro- 
vide a better fit to empirical phylogenomic data. Edwards et al. 
[124] and Liu and Pearl [58] both used the Bayesian species tree 
method BEST [68] to ask using Bayes factors whether the MSC or 
concatenation fits empirical data sets better. Uniformly, they found 
that the MSC fit empirical data sets better than concatenation, 
often by a large margin. However, further work in this area is still 
needed. Most discussions in the literature have focused on the 
perceived failings or violations of the MSC by empirical data 
sets—such as evidence for recombination within loci—even when 
such failings or assumptions also apply to concatenation [47]. 
Given that all models are approximations of reality, a better focus 
would be to ask which model better fits empirical data sets better. 
The limited research that has been done suggests overwhelmingly 
that the MSC provides a better fit to empirical data sets than 
concatenation. 

Are there better models for phylogenomics than the MSC? 
Depending on the data set, almost surely there are (Fig. 4). Several 
authors working with phylogenomic data sets have suggested that 
gene flow is detectable, even among lineages that diverged a long 
time ago (e.g., [129, 130]). The increasing number of reports of 
hybridization and introgression among phenotypically distinct spe- 
cies suggests that hybridization may be a typical component of 
speciation and that even phylogenetic models can be improved by 
incorporating such reticulation (e.g., [47, 106, 131]). The pure 
MSC is best thought of as a special case of so-called “multispecies 
network coalescent” models, or MSNC [127, 132-134] (Fig. 4), in 
which gene flow connects some branches of the species tree. In the 
end, empiricists will need to decide what level of model fit they are 
willing to tolerate and which software packages can accommodate 
the large data sets that are now routine in phylogenomics. 


Genes whose phylogenetic signal differs significantly from that of 
the remainder of data set can be thought of as phylogenetic outliers. 
These loci are conceptually similar to outliers in population genet- 
ics, which have been the focus of many studies (reviewed in 
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Fig. 4 Diversity of phylogeographic models. Species trees estimated by the 
multispecies coalescent are naturally related to previous phylogeographic mod- 
els by their shared demographic parameters, usually measured in units of 
mutation rate or substitutions per site (ul, including genetic diversity or effective 
population size (4Nu, where N= effective population size; gene flow Min, where 
M = the scaled migration rate; 4Nm, where m is the number of migrants per 
generation; and divergence time t = yt, where tis the divergence time in 
generations). (a) Equilibrium migration models as envisioned by early versions of 
the software MIGRATE [125]. (b) Isolation-migration models envisioned by Hey 
and coworkers [48, 95, 126]. Subscript A indicates ancestral population size. (c) 
Species tree models estimated by the multispecies coalescent [28]. (d) Multi- 
species network coalescent models or phylogenetic network models including 
divergence and gene flow [127, 128] 


[135-137]). However, there has been little work in detecting 
phylogenomic outliers. Much attention has been paid to particular 
sites in a data set that differ from the majority and therefore exhibit 
homoplasy or incongruence with the rest of the data set 
[76, 138]. The sources of such incongruence are many and can 
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2.2.2 Genomic Signals of 
Phylogenetic Outliers 


include mutational processes (e.g., gene duplication), HGT, as well 
homoplasy (e.g., [139, 140]). Incongruence of particular sites, or 
entire loci, may also be due to technical issues such as contamina- 
tion, misassembly, mistaken paralogy, annotation mistakes, and 
alignment errors (e.g., [119]). Here, in an analogy with work in 
population genetics, we will focus primarily on entire loci that 
deviate from the expected distribution governed by neutral pro- 
cesses due to natural selection. Understanding the distribution of 
gene tree topologies expected under the neutral multispecies coa- 
lescent [25] is a good starting point for identifying loci that may be 
targets of natural selection. 


When faced with a surprising or nonconvergent species tree, one 
possibility is that an unusual gene tree is to blame. Though techni- 
ques for dealing with violations of the coalescent model are in their 
infancy, researchers do have a few options. Below we list several 
ideas, some borrowed from classical phylogenetics or from meth- 
ods used in bioinformatics. It is likely that the several tests con- 
structed to detect phylogenetic outliers in classical phylogenetics 
can be extended slightly to incorporate the additional variation 
among genes expected due to the coalescent process. Of course, 
with larger data sets, at least with some coalescent methods, single 
anomalous genes may have little effect on the resulting species tree, 
particularly in species tree methods utilizing summary statistics 
[65]. However, as pointed out above, species tree methods such 
as BEST that relies on “hard” boundaries for the species tree by 
individual genes could be derailed due to the anomalous behavior 
of even a single gene. 

Jackknifing. A straightforward approach to detecting phyloge- 
netic outliers under the multispecies coalescent model is to rerun 
the analysis 7 times, where a is the number of loci in the study, 
leaving one locus out each time. An outlier can then be identified if 
the analysis that does not include that gene differs from the remain- 
ing analyses in which that gene is included. This approach has been 
applied successfully in fruit flies by Wong et al. [21], who consid- 
ered their problem resolved when the elimination of one of the ten 
genes unambiguously resolved a polytomy. There may be other 
metrics of success that are more robust or sensitive or do not 
depend as strongly on a priori beliefs about the relationships 
among taxa. Because some duplications or horizontal transfers 
may affect only one taxon, whole-tree topology summary statistics 
are unlikely to be sensitive enough to detect recent events. How- 
ever, the cophenetic distance of each taxon to its nearest neighbor 
in the complete species tree could be compared across jackknife 
results. This procedure will produce a distribution of “typical” 
distances, and significance can therefore be assigned to highly 
divergent results. The drawback to such an approach is the 
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2.2.3 Simulation 
Approaches to Detecting 
Phylogenetic Outliers 


computational demand. Species tree analyses on their own can be 
extremely time consuming to run even once, so jackknifing 
may prove intractable for studies involving many species and loci 
(see ref. 141). 


Simulating gene trees from a species tree is another method for 
identifying gene trees that differ from the majority of loci in the 
data set. Several species tree methods yield estimate of the phylog- 
eny that include branch lengths in coalescent units [56, 57, 70], 
which are required to simulate gene trees from a species tree. 
Branch lengths in the estimated species tree can be decomposed 
into a number of substitutions per site and an estimate of 0 = ANN 
that are compatible with the original branch length in coalescent 
units. For example, using any number of algorithms, including 
maximum likelihood or Bayesian methods, the length of species 
tree branch lengths in substitutions per site can be approximated by 
fitting the concatenated alignment of genes to the estimated species 
tree topology, yielding a tree with the same topology but branch 
lengths in substitutions per site (ut, where tis the time span of the 
branch in either generations or years). With these branch lengths in 
hand, estimates of d can then be applied to each branch so that the 
original coalescent units t/2N ~ wt/@ from the species tree are 
retained. Care needs to be taken to preserve the appropriate ploidy 
units when simulating gene trees from an estimated species tree. 
Packages such as MP-EST yield estimates of species tree branch 
lengths in coalescent units of 4N generations, appropriate for 
diploids, whereas packages such as Phybase [142] simulate gene 
trees from a species tree in estimates of 2 N units, appropriate for 
haploids. Another issue that is important to be aware of is the 
distinction between gene coalescence times and species tree branch 
lengths [143, 144]. Whereas species tree branch lengths are esti- 
mates of lineage or population branch lengths in the species tree, 
the DNA sequence alignment that is fitted to the species tree will 
yield branch lengths reflecting the coalescence time of genes in 
ancestral species. This discrepancy occurs because gene coalescence 
times by necessity predate and record a more ancient event than do 
species divergence times. The discrepancy may represent a small 
fraction of the branch length if species divergence times are large, 
but Angelis and dos Reis [143] have suggested that the discrepancy 
can be quite large even in comparisons of distantly related species, 
such as exemplars of mammalian orders. There is a great need for 
methods of molecular dating and combining fossils and DNA data 
that distinguish between gene coalescence times and speciation 
times, the latter of which is usually of primary interest. 

Once the branch lengths of the species tree are prepared for 
simulation, gene trees can be simulated using a number of packages 
(Phybase, [142]; TreeSim, [145]; CoMus, [146]). Even packages 
traditionally used in phylogeography can be used to simulated gene 
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trees on species trees, given the close relationship between species 
trees and phylogeographic models like isolation migration 
[147, 148]. One can then compare the distribution of gene tree 
topologies and branch lengths observed in one’s data set with those 
simulated under the neutral coalescent model. A common 
approach is to calculate the distribution of Robinson-Foulds 
[149] distances among simulated gene trees and compare these to 
those observed in the original data set. Such approaches have been 
used to determine if a data set is consistent with the MSC or the 
percent of the observed gene tree variation that is explained by the 
MSC. Other statistics, such as the similarity in number of minority 
gene tree triplets produced by a given species tree at each node, can 
also be compared to the observed distribution. Song et al. [150] 
used coalescent simulations using Phybase to propose that the MSC 
could explain a large (>75%) fraction of the observed gene tree 
variation in a mammalian data set. Such simulations assume that the 
gene tree variation observed is biological in origin and not due to 
errors in reconstruction. They also noted that the near equivalence 
in frequency of minority triplets in gene trees at various nodes in the 
mammal tree suggested broad applicability of the neutral coalescent 
without gene flow or other complicating factors. Still, many papers 
observe some level of departure of the patterns in the observed data 
set from those expected under simulation. Usually the source of 
this departure is unknown. Natural selection or any other force 
such as HGT or anomalous mutation might be culprits in these 
cases. Heled et al. [151] proposed a simulation regime that incor- 
porates gene flow between species and thus can be used to test for 
the effects of migration on gene trees and species tree estimation. 

To detect possible phylogenetic outliers, Edwards et al. [152 | 
applied a recently proposed method of detecting gene tree outliers, 
KDEtrees [153], to a series of phylogenomic data sets. KDEtrees 
uses the kernel density distribution of gene tree distances to esti- 
mate the 95% confidence limits on gene tree topologies in a given 
data set. Surprisingly, using default parameters, Edwards et al. 
[152] could not detect a higher-than-expected number of gene 
tree outliers in any data set, despite the fact that the data sets in 
several cases contained hundreds of loci. No data set possessed 
more than the expected 5% of outliers given the test implemented 
in KDEtrees. Clearly further work is needed to understand the pros 
and cons of various tests of phylogenetic outliers. For the time 
being, we can note the robustness of various species tree methods 
to phylogenetic outliers. One attractive prospect of algorithms for 
species tree construction that use summary statistics, such as STAR 
and STEAC, is that these methods are powerful and fast, yet they 
appear less susceptible to error due to deviations of single genes 
from neutral expectations. These methods do not utilize all the 
information in the data and hence can be less efficient than Bayesian 
or likelihood methods [52], yet they perform well with moderate 
amounts of gene tree outliers due to processes like HGT. 
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3 Hypothesis Testing Using the Multispecies Coalescent Model 


Hypothesis testing is a cornerstone of phylogenetic analysis but has 
received little attention in the context of the MSC (see ref. 154). 
Bayesian species tree inference [58, 59, 68-70] provides perhaps 
the most seamless approach to hypothesis testing. One can rela- 
tively easily assess the fit of the collected data to alternative tree 
topologies and compare the fit using Bayes factors or other 
approaches. One can also assess the fit of various models of analysis 
to the collected data [155]. Liu and Pearl [58] and Edwards et al. 
[124] used Bayes factors to determine whether concatenation or 
the MSC was a more appropriate model for several data sets; in all 
cases tested thus far, the MSC provides a far better fit to multilocus 
data (BF > 10) than does concatenation, in which all gene trees 
among loci are identical. Further work is needed to apply Bayes 
factors and likelihood ratio tests to multilocus data. 

The bootstrap, introduced to phylogenetics by Felsenstein 
[156], is the most common statistic applied to phylogenetic trees 
[157]. In the era of multilocus phylogenetics, the “multilocus 
bootstrap” of Seo [158] has been recommended as a more suitable 
approach to assessing confidence limits than the traditional boot- 
strap. In the traditional bootstrap, sites within a locus, or a series of 
concatenated loci, are resampled with replacement to create pseu- 
domatrices, which are then subjected to phylogenetic analysis, after 
which a majority rule consensus tree is usually made. By contrast, in 
the multilocus bootstrap, sites within loci and the loci themselves 
are resampled with replacement. In the context of the MSC, 
resampled pseudomatrices of the same number of loci as the origi- 
nal data set, which may contain duplicates of specific loci due to the 
random nature of the bootstrap, are then made into gene trees, 
from which a species tree can be made. The bootstrap and various 
other measures of branch-specific support [159] have been pro- 
posed as a means of assessing confidence in species trees made using 
the multilocus coalescent. Care should be taken in the comparison 
of different studies using different measures of support, since not all 
measures can be directly compared to one another. For example, as 
pointed out by Liu et al. [160], the measure of posterior support 
for ASTRAL trees proposed by Sayyari and Mirarab [159] is not the 
same as traditional bootstrap supports, and we do not yet know 
how they will scale under different conditions compared to the 
bootstrap. Edwards [161] summarized knowledge about the use 
of phylogenomic subsampling, in which data sets of increasing size 
or signal are analyzed so as to understand the stability and speed of 
approach to certainty of phylogenetic estimates under the MSC and 
under concatenation. He found that MSC methods tended to 
approach phylogenomic certainty more smoothly and monotoni- 
cally than do concatenation methods, which jump around errati- 
cally in their certainty for sometimes conflicting topologies, 
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especially when sampling smaller numbers of genes. Although we 
cannot simply translate many conclusions from the gene tree era of 
phylogenetics to the MSC era—for example, contrary to gene tree 
conclusions, it is not clear for MSC models that more taxa are 
always better than more loci [74]—many of these discussions 
about hypothesis testing echo early comparisons of posterior prob- 
abilities and bootstrap proportions used in the gene tree era of 
phylogenetics. 

The bootstrap has always provided a means of hypothesis test- 
ing that is very indirect with respect to comparing alternative 
phylogenetic hypotheses. Aside from the tests allowed by Bayesian 
approaches, there have been few discussions of testing of alternative 
phylogenetic trees in the era of the multispecies coalescent. In this 
regard, the pseudo-likelihood model provided by MP-EST [56] 
provides a convenient framework for hypothesis testing using spe- 
cies trees. This framework is not available in most other species tree 
methods, including ASTRAL, STAR, and STEAC, since these 
methods do not employ a likelihood model. MP-EST takes advan- 
tage of the likelihood model of Rannala and Yang [48] to assess the 
fit of a species tree to a collection of gene trees and can thus be used 
to compare alternative species tree topologies and branch lengths 
directly. 

To conduct a direct comparison of species trees using the 
likelihood ratio test, we first compare the likelihoods of two trees 
to find the most probable species tree that can explain the empirical 
set of gene trees. The likelihood ofa set of gene trees given a species 
tree with branch lengths can be ascertained using functions in 
Phybase [142]. Let Tree 1 be the null tree and Tree 2 be the 
alternative tree. The likelihood ratio test statistic is ¢ = 2 
(Ltree2 ER Lrrec1), in which seet and Lrree2 are the 
log-likelihoods of the null and alternative hypotheses. The 
log-likelihood of the null hypothesis can be obtained from the 
output of the program MP-EST by fitting the branch lengths and 
topology of Tree 1 to the set of empirical gene trees. Similarly, we 
can find the log-likelihood of the alternative tree Tree 2 using 
MP-EST. The null distribution of the test statistic tis approximated 
by a parametric bootstrap. Specifically, we generate 100 or more 
bootstrap samples of gene trees under the null tree Tree 1. For each 
sample of these bootstrapped trees, we calculate the log-likelihoods 
of the null and alternative trees using the procedure described 
above. The null distribution of the test statistic ¢ is approximated 
by the test statistics of the bootstrap samples. If ¢ for the null and 
alternative species trees is outside the expected distribution of the 
bootstrap sample statistics, then the result can be considered 
significant. 

We applied this approach to assessing alternative phylogenetic 
hypotheses to an example from birds (fairy wrens; [162]; Fig. 5). 
This data set consists of 18 genes and 26 taxa, with loci coming 
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Tree 1 Malurus_amabilis Tree 2 Malurus_amabilis Tree 3 Maus anil 


Malurus_lambert Malurus_lamberti Malurus_lambert 
Malurus_elegans 


Malurus_elegans Malurus_elegans 

Malurus_pulcherrimus Malurus_pulcherrimus Malurus pulcherimus 
Malurus_cyaneus Malurus_cyaneus Malurus_cyaneus 
Malurus_splendens Malurus_splendens Malurus_splendens 
Malurus_melanocephalus Malurus_melanocephalus Malurus_melanocephalus 
Malurus_alboscapulatus Malurus_alboscapulatus Malurus_alboscapulatus 
Malurus_leucopterus Malurus_leucopterus Malurus_leucopterus 
Malurus_coronatus Malurus_coronatus Wée zm 
Malurus_cyanocephalus Malurus_cyanocephalus Malurus cyanocephalus 
Malurus_grayi Malurus_grayi Malurus grayi 
Clytomias_insignis Clytomias_insignis Des insignis 
Stipiturus_mallee Stipiturus_mallee Jus lee 
Stipiturus_ruficeps Stipiturus_ruficeps Ange am 
Stipiturus_malachurus Stipiturus_malachurus Amytomis_goyderi 
Amytornis_ballarae Amytornis_ballarae Ans weg 
Amytomis_pumelli |Amytoris_pumeli Amis textilis 
Amytornis_goyderi Amytornis_goyderi Amis vest 
Amytornis_housei Amytornis_housei Amytornis_dorotheae 
Amytornis_textilis Amytornis eis Amytomis_striatus 
Amytornis_merrotsyi Amytornis_merrotsyi Ange barbatus 
Amytornis_dorotheae Amytornis_dorotheae Stipiturus_ruficeps 
Amytornis_striatus Amytornis_striatus Stipiturus_mallee | 
Amytornis _barbatus Amytornis_barbatus Stipiturus_malachurus 
Gerygone_olivacea Gerygone_olivacea Gerygone_oivacea 


Tree 1 vs. Tree 2 


| | | | 
-2000 0 2000 4000 


Tree 1 vs. Tree 3 


| | | | 
-2000 0 2000 4000 


Tree 2 vs. Tree 3 


-3000 


-1000 0 1000 3000 


Fig. 5 Example of hypothesis testing of alternative phylogenetic trees under the multispecies coalescent 
model. Top: alternative phylogenetic hypotheses involving the rearrangement of major groups of Australo- 
Papuan fairy wrens based on Lee et al. [162]. The three alternative phylogenetic trees are colored to indicate 
the three major groups whose relationships are being tested. Bottom: results of the likelihood ratio test (LRT) 
and estimates of confidence limits on the test statistic t using parametric bootstrapping. The plots show the 
distributions of the test statistic tresulting from gene trees built from resampled, bootstrapped sequence data. 
Despite the use of sequence data to generate the bootstrap gene tree distributions, the LRT is only an indirect 
test of the signal in the sequence data and instead is best thought of as a test of the fit of the estimated gene 
tree distribution on alternative phylogenies. See main text for further details 
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from a variety of marker types (exons, introns, anonymous loci). 
Lee et al. [162] applied a number of MSC approaches to this data set 
but did not compare alternative trees directly, having only used 
bootstrap approaches. Here, we consider three-species trees gener- 
ated from the rearrangement of the three major clades of wrens: the 
core fairy wrens ( Malurus), emu-wrens (Stipitirus), and grasswrens 
(Amytornis, Fig. 5). Rearranging these major clades results in three 
alternative rooted species trees. Based on traditional taxonomy and 
because the gene trees in this data set were highly variable, even 
among the three major clades, we consider these three alternative 
hypotheses true alternatives and not “straw men.” Rooted maxi- 
mum likelihood gene trees were built from the alignments of each 
locus using RaxML [163] and then used as input data for the 
likelihood ratio test described above. The LRT was applied first to 
Tree 1 (null) versus Tree 2 and was also applied to Tree 1 versus Tree 
3 and Tree 2 versus Tree 3. The results indicate that Tree 1 fits to the 
empirical gene trees significantly better than does Tree 2 or Tree 
3 does (p < 0.01), and there is no significant difference between 
Trees 2 and 3 in their fit to the empirical gene trees ( p = 0.52). 
Thus, the LRTs strongly favor Tree 1 over both Tree 2 and Tree 3. 
It is important to note that the LRT described above is not a 
direct test of the phylogenetic signal in the DNA sequence data. 
Rather, it is a test of the distribution of gene trees inferred from the 
sequence data and assumes that the gene trees provided as data are 
without error. It does indirectly test the signal in the sequence data, 
because if the DNA sequences provide strong and consistent sup- 
port of the gene trees, then the bootstrapped set of gene trees will 
be highly similar to one another, and the confidence limits on ¢ will 
be very tight. By contrast, if the DNA sequence data does not have 
a strong signal, then the confidence limits on ¢ will be very wide, 
and it will be difficult to reject alternative species trees. The LRT 
described here does not involve nested models. If the gene trees are 
known without error, then the value of ż itself can be used to assess 
significance, assuming a chi-square distribution with 2 degrees of 
freedom. Further research is needed on methods for comparing 
and testing alternative species trees in the context of the MSC. 


Species tree methods are likely to continue to gain ascendancy as 
the strongest evidence of taxonomic relationship in phylogenetic 
research. As with any form of evidence, the conclusions of a species 
tree analysis are fallible, with each method susceptible to biases in 
the input data. For example, Xi et al. [164] showed that Phyml 
[165] yields biased gene trees when there is little information in the 
DNA sequences and can therefore result in biased species trees. 
This issue is particularly problematic when using MP-EST v. 1.5, 
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which, unlike ASTRAL or MP-EST v. 2.0, does not randomly 
resolve or appropriately accommodate gene trees with polytomies 
or 0 or near O-length branches. This bias may have affected the 
performance of MP-EST in previous side-by-side comparisons with 
ASTRAL. In the future, further work should be devoted to discov- 
ering and quantifying additional biases in inference of species trees. 
With the size of phylogenomic data sets increasing, even small 
biases can be amplified and result in poorly estimated species trees. 

Many in the field agree that the most appealing statistical 
models for species tree inference using the MSC include Bayesian 
and full-likelihood models [52]. But it is still clear, at least to 
empiricists, not only that “two-step” methods of species tree infer- 
ence work quite well in general but also that the large phyloge- 
nomic data sets available today prohibit the use of full-likelihood 
methods. Regardless, we now know that both types of models 
clearly outperform concatenation across wide swaths of parameter 
space, especially if one also evaluates the reliability of the confidence 
limits on the estimate of phylogeny and not only the point estimate 
of the topology. The major directions for future research in the field 
of species tree inference therefore include increasing the scalability 
of computational inference of species trees, further development of 
frameworks for hypothesis testing using the MSC, developing 
additional models of divergence with gene flow and network coa- 
lescent models (Fig. 4), and improvement in the estimation of gene 
trees and species trees from SNP data [166]. Linking mutations in 
species trees and heterogeneous gene trees to diverse phenotypic 
and ecological data will be another important avenue for the future 
[167, 168]. We view the MSC, with its application of population 
genetic models to higher-level systematics, as a key component of 
the long-term goal of uniting microevolution and macroevolution. 
Even if it proves incomplete in the long term, the neutral MSC 
provides a powerful null model for the understanding of genetic 
diversity across time and space. 


l. Consider the following discordant set of gene trees. {Gene 
1 = (A:10,(B:8,C:8):2); Gene 2 = (B:9,(A:6,C:6):3); Gene 
3 = ((A:4,B:4):4,C:8)}. Assuming that these genes perfectly 
reflect the time of genetic divergence, and the only cause of 
discordance is incomplete lineage sorting or deep coalescence, 
what is the most likely species tree? Answer: ((A:4,B:4):2,C:6) 


2. Find the data set for 30 noncoding loci from 4 species of 
Australian grass finches (3 Poephila, plus out-group Taentopy- 
gia) from Jennings and Edwards [169]. It can be found in the 
web page for Liang Liu’s BEST program: http: //faculty.frank 
lin.uga.edu/lliu/content/BEST. Use the Bayesian program 
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BEST [68] or BPP [70] and the nonparametric method in 
STAR [65] to estimate the species tree for the four species, 
using Taeniopygia as the out-group. Do you estimate the same 
topology with both methods? What about the support for the 
single internal branch? If the support is not the same, what 
could be causing the difference? Answer: The BEST or BPP tree 
should have higher support than the STAR tree, but they both 
should have the same topology. The STAR tree might have lower 
support because in the data set about half of the gene trees have a 
topology differing from the species tree; whereas the full Bayesian 
model accommodates this variation accurately, nonparametric 
“two-step” methods interpret this type of gene tree variation as 
discordance, in conflict with the majority of the gene trees and 
with the species tree. 


. For the above data set, make individual gene trees using RaXml 


[170], and use the likelihood functions and bootstrap capabil- 
ities of Phybase [142] to conduct a likelihood ratio test of the 
two alternative species tree topologies for the four grass 
finches. Alternatively, you could use the posterior distribution 
of gene trees generated in BEST to estimate the confidence 
limits on the test statistic t. Is the tree estimated in question 
2 significantly better than alternative trees? Answer: The LRT 
indicates that the tree estimated in question 2 is significantly 
better than alternative trees. 
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Abstract 


Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolu- 
tionary genomics, and a variety of approaches for such comparison have been developed. In this article we 
present several methods for comparative analysis of large numbers of phylogenetic trees. To compare 
phylogenetic trees taking into account the bootstrap support for each internal branch, the boot-split 
distance (BSD) method is introduced as an extension of the previously developed split distance 
(SD) method for tree comparison. The BSD method implements the straightforward idea that comparison 
of phylogenetic trees can be made more robust by treating tree splits differentially depending on the 
bootstrap support. Approaches are also introduced for detecting treelike and netlike evolutionary trends in 
the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for conserved genes of 
prokaryotes. The principal method employed for this purpose includes mapping quartets of species onto 
trees to calculate the support of each quartet topology and so to quantify the tree and net contributions to 
the distances between species. We describe the applications methods used to analyze the FOL and the 
results obtained with these methods. These results support the concept of the Tree of Life (TOL) as a 
central evolutionary trend in the FOL as opposed to the traditional view of the TOL as a “species tree.” 


Key words Forest of Life, Tree of Life, Phylogenomic methods, Tree comparison, Map of quartets 
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SD Split distance 
TNT Tree-Net trend 
TOL Tree of Life 


1 Introduction 


With the advances of genomics, phylogenetics entered a new era 
that is noted by the availability of extensive collections of phyloge- 
netic trees for thousands of individual genes. Examples of such tree 
collections are the phylomes that encompass trees for all sufficiently 
widespread genes in a given genome [1-4] or the “Forest of Life” 
(FOL) that consists of all trees for widespread genes in a represen- 
tative set of organisms [5 ]. It has been known since the early days of 
phylogenetics that trees built on the same set of species often have 
different topologies, especially when the set includes distant spe- 
cies, most notably, in prokaryotes [6, 7]. The availability of “for- 
ests” consisting of numerous phylogenetic trees exacerbated the 
problem as an enormous diversity of tree topologies has been 
revealed. The inconsistency between trees has several major 
sources: (1) problems with ortholog identification caused primarily 
by cryptic paralogy; (2) various artifacts of phylogenetic analysis, 
such as long branch attraction (LBA); (3) horizontal gene transfer 
(HGT); and (4) other evolutionary processes distorting the verti- 
cal, treelike pattern such as incomplete lineage sorting and hybri- 
dization |1, 8-10]. In order to obtain robust results in genome- 
level phylogenetic analysis, for instance, to classify phylogenetic 
trees into clusters with (partially) congruent topologies or to iden- 
tify common trends among multiple trees, reliable methods for 
comparing trees are indispensable. 

The number and diversity of tree comparison methods and 
software have substantially increased in the last few years. The tree 
comparison methods variously use tree bipartitions, such as parti- 
tion or symmetric difference metrics [11] and split distance [12]; 
distance between nodes such as the path length metrics [13], nodal 
distance [12, 14], and nodal distance for rooted trees [15]; com- 
parison of evolutionary units such as triplets and quartets [16]; 
subtransfer operations such as subtree transfer distance [17], 
nearest-neighbor interchanging [18], subtree prune and regraft 
(SPR) using a rooted reference tree [19], SPR for unrooted trees 
[20] and tree bisection and reconnection (TBR) [17], and match- 
ing pair (MP) distance [21 ]; (dis)agreement methods such as agree- 
ment subtrees [22], disagree [12], corresponding mapping [23], 
and congruence index [24]; tree reconciliation [25]; and topologi- 
cal and branch lengths methods such as K-tree score [26]. Several 
algorithms have been proposed to analyze with multi-family trees. 
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For example, the From Multiple to Single (FMTS) algorithm sys- 
tematically prunes each gene copy from a multi-family tree to 
obtain all possible single-gene trees [12] and an algorithm imple- 
mented in TreeKO prunes nodes from the input rooted trees in 
which duplication and speciation events are labeled [27]. Another 
algorithm employs a variant of the classical Robinson-Foulds 
method to compare phylogenetic networks [28]. However, to the 
best of our knowledge, none of the available metrics for tree 
comparison takes into account the robustness of the branches, a 
feature that appears important to minimize the impact of artifacts 
(unreliable parts of a tree) on the outcome of comparative tree 
analysis. Here, we present the boot-split distance (BSD) method 
that calculates distances between phylogenetic trees with weighting 
based on bootstrap values. This method is implemented in the 
program TOPD/FMTS [12]. In our recent research, we used the 
BSD method combined with classical multidimensional scaling 
(CMDS) analysis to explore the main trends in the phylogenetic 
FOL and to explore the “Tree of Life” (TOL) concept in light of 
comparative genomics [5, 29]. 

Since the time (ca 1838) when Darwin drew the famous sketch 
of an evolutionary tree in his notebook on transmutation of species, 
with the legend “I think...,” the thinking on the “Tree of Life” 
(TOL) has evolved substantially. The first phylogenetic revolution, 
brought about by the pioneering work of Zuckerkandl and Pauling 
[30] and later Woese and coworkers [31], was the establishment of 
molecular sequences as the principal material for phylogenetic tree 
construction. The second revolution has been triggered by the 
advent of comparative genomics when it has been realized that 
HGT, at least among prokaryotes, was much more common than 
previously suspected. The first revolution was a triumph of the tree 
thinking, when a well-resolved TOL started to appear within reach. 
The second revolution undermines the very foundation of the TOL 
concept and threatens to destroy it altogether [32-34]. 

The current views of evolutionary biologists on the TOL span 
the entire range from acceptance to complete rejection, with a host 
of moderate positions. The following rough classification may be 
used to summarize these positions (a) acceptance of the TOL as the 
dominant trend in evolution: HGT is considered to be rare and 
overhyped, and most of the observed “transfers” are deemed to be 
artifacts [35-38]; (b) the TOL is the common history of the 
(nearly) nontransferable core of genes, surrounded by “vines” of 
HGT [39-50]; (c) each gene has its own evolutionary history 
blending HGT and vertical inheritance; a statistical trend might 
exist in the maze of gene histories, and it could even be treelike 
[5, 29, 51, 52]; and (d) ubiquity of HGT renders the TOL concept 
totally obsolete (prokaryotic species and higher taxa do not exist, 
and microbial “taxonomy” is created by a pattern of biased HGT) 
[32, 34, 53-58]. 
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2 Materials 


2.1 The Forest of Life 
(FOL) and Nearly 
Universal Trees (NUTs) 


METHODS X CONCEPTS 


Boot split distance (BSD) Forest of life (FOL) 
Inconsistency Score (IS) Nearly universal trees (NUTs) 
Classical multidimensional scaling Central trend tree of life (TOL) 
analysis( CMDS) Patterns in the FOL 
Map of quartet species Tree and net components of evolution 


The Tree-Net Trend (TNT) 


Fig. 1 A schematic of the methods and concepts involved in the FOL analysis 


We found that, although different trends and patterns have to 
be invoked to describe the FOL in its entirety, the main, most 
robust trend is the “statistical TOL,” i.e., the signal of coherent 
topology that is discernible in a large fraction of the trees in the 
FOL, in particular, among the nearly universal trees (NUTs) 
[59, 60]. 

We further explored the FOL by analysis of species quartets 
[61]. A quartet is a group of four species which is the minimum 
evolutionary unit in unrooted phylogenetic trees; each quartet can 
assume three unrooted tree topologies [16]. We described a quan- 
titative measure of the tree and net signals in evolution that is 
derived from an analysis of all quartets of species in all trees of the 
FOL. The results of this analysis indicate that, although diverse 
routes of netlike evolution jointly dominate the FOL, the pattern of 
treelike evolution that recapitulates the consensus topology of the 
NUTs is the single most prominent, coherent trend. Here, we 
report an extended version of these methodologies introduced to 
analyze the FOL and its trends, as well as new concepts of prokary- 
otic evolution under the FOL perspective (Fig. 1). 


We analyzed the set of 6901 phylogenetic trees from [5] that were 
obtained as follows. Clusters of orthologous genes were obtained 
from the COG [62] and EggNOG [63] databases from 100 pro- 
karyotic species (59 bacteria and 41 archaea). The species were 
selected to represent the taxonomic diversity of Archaea and Bacte- 
ria (for the complete list of species, see Additional File 1). The BeTs 
algorithm [62] was used to identify the orthologs with the highest 
mean similarity to other members of the same cluster (“index 
orthologs”), so the final clusters contained 100 or fewer genes, 
with no more than one representative of each species. The 
sequences in each cluster were aligned using the Muscle program 
[64] with default parameters and refined using Gblocks [65]. The 
program Multiphyl [66], which selects the best of 88 amino acid 


3 Methods 


3.1 Boot-Split 
Distance: A Method to 
Compare Phylogenetic 
Trees Taking into 
Account Bootstrap 
Support 


3.1.1 Boot-Split Distance 
(BSD) 
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substitution models, was used to reconstruct the maximum likeli- 
hood tree of each cluster. The nearly universal trees (NUTs) are 
defined as trees from COGs that are represented in more than 90% 
of the species included in the study. 


The BSD method compares trees based on the original split dis- 
tance (SD) [12] method. Both methods work by collecting all 
possible binary splits of the two compared trees and calculating 
the fraction of equal splits, i.e., those splits that are present in both 
trees (different splits refer to splits that are present in only one of 
the two trees). Instead of considering all branches as being equal as 
is the case in SD, the BSD method takes into account the bootstrap 
values to increase or decrease the SD value proportionally to the 
robustness of individual internal branches. The BSD value is the 
average of the BSD in the equal splits (eBSD) and the BSD in the 
different splits (Eq. 1). Equations 2 and 3 give the formulas to 
calculate the eBSD and dBSD values, respectively. 


asp — BSD + 4BSD a) 
2 
e 
eBSD = 1 — [E-m] (2) 
d 
dBSD = =M 4 (3) 


Here eis the sum of bootstrap values of equal splits, dis the sum 
of bootstrap value of different splits, æ is the sum of all bootstrap 
values, M, is the mean bootstrap value of equal splits, and Mis the 
mean bootstrap value of different splits. 

The BSD algorithm proceeds in four basic steps to compare 
pairs of trees (Fig. 2). The first step is to obtain all possible splits 
from both trees. This procedure implies a binary split of the tree at 
each internal branch, so that the tree is partitioned into two parts 
each of which contains at least two species. Then, the common set 
of leaves between the two trees is obtained, that is, the set of shared 
species. Only trees with a common leaf set of at least four species 
can be compared. The third step consists in pruning all splits to the 
common leaf set of species; at this step, species that are present in 
only one of the two compared trees are removed from the split list. 
After this procedure, in partially overlapping trees, the algorithm 
checks whether each of the splits remains a valid partition, that is, a 
partition that separates at least two species from the rest of the tree. 
Ifa split is not a valid partition, it is removed. Finally, the algorithm 
calculates the BSD using Eqs. 1-3. 
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3.1.2 The BSD Algorithm 


All splits from } Wi All splits from 
tree 1 pruned to i tree 2 pruned to 
common species |! common species 


Fig. 2 The main algorithm of the BSD method. The algorithm to calculate the 
BSD between two trees includes four basic steps: (1) split both trees in all 
possible partitions, (2) read the common set of species of both trees, (3) prune 
the splits according with the common leaf set, and (4) calculate the BSD 


There are three possible types of comparisons for trees that do not 
include paralogs, that is, include one and only one sequence from 
each of the constituent species (Fig. 3). In the first case, the two 
trees completely overlap, that is, consist of the same set of species 
(Fig. 3a). In this case, step 2, the pruning procedure, is not neces- 
sary, and the comparison involves only obtaining all possible splits 
and the calculation of the BSD. In the second case, one of the 
compared trees is a subset of the other tree (Fig. 3b). In this case, 
the splits are only pruned and occasionally removed from the bigger 
tree. In the third case, when the two trees partially overlap or when 
a tree is a subset of another tree, a pruning procedure is required. In 
the example shown in Fig. 4, after the pruning procedure (step 3), 
there is only one remaining split (split: AB|CD) that is repeated 
several times in both trees. The remaining AB|CD split in Tree 1 is 
separated by four nodes that have different bootstrap values. In this 
case, the bootstrap of the remaining split is calculated using Eq. 4, 
where a is the total number of nodes between the two sides of the 
split and BS; is the bootstrap value (adjusted to the 0-1 range) of 
the node 2. 


Bootstrap = 1 — I1?_,(1 — BS,) (4) 


The bootstrap value associated with a particular branch of a 
binary tree is taken as a measure of the probability that the four 


3.1.3 Using a Bootstrap 
Threshold: Pros and Cons 
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Fig. 3 Examples of the BSD algorithm in single family trees. (a) Two trees of the 
same size. (b) Tree 1 is a subtree of the Tree 2. Two trees that partially overlap. 
SD split distance, BSD boot-split distance, eBSD BSD of equal splits, dBSD BSD 
of different splits, p number of equal splits, g number of different splits, m total 
number of splits, a sum of bootstraps in all splits, e sum of bootstraps in equal 
splits, d sum of bootstraps in different splits, M, mean bootstrap value, Mẹ mean 
bootstrap value in equal splits, M4 mean bootstrap value in different splits 


subtrees on the opposite ends of this branch are partitioned cor- 
rectly. To estimate the probability of the correct partitioning of an 
arbitrary set of four subtrees, the internal branch of the quartet tree 
is mapped onto each of the internal branches of the original tree. 
The quartet is considered to be resolved correctly if it is resolved 
correctly relative to any of these branches. Under the assumption 
that bootstrap probabilities on individual branches are indepen- 
dent, Eq. 4 is obtained as the estimate of the bootstrap probability 
for the internal branch of the quartet tree. 


The key question regarding the BSD method is as follows: what is 
the best approach to phylogenetic tree comparison—using all 
branches, reliable or not, with the appropriate weighting, or using 
only branches supported by high bootstrap values? The first option 
is illustrated in Fig. 3, whereas Fig. 5 shows an example of a tree 
comparison that employs a bootstrap threshold of 70, i.e., only 
branches supported by a higher bootstrap are taken into account in 
the comparison. The second procedure appears reasonable and can 
be recommended in some cases. However, it is not advisable as a 
general approach because, when two large trees with varying 
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Bootstrap = 1-(0.1x0.9x0.9x0.9) = 0.93 


Fig. 4 Calculation of BSD for trees with an unequal numbers of species. The 
larger tree (1) is pruned prior to the calculation of BSD. The bootstrap value for 
the only shared internal branch is calculated according to Eq. 4 
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Fig. 5 Example of the BSD algorithm using a bootstrap cutoff. The figure shows 
the comparison of two phylogenetic trees that takes into account only those 
branches with bootstrap support greater than 70. SD split distance, BSD boot- 
split distance, eBSD BSD of equal splits, (BSD BSD of different splits, p number 
of equal splits, q number of different splits, m total number of splits, a sum of 
bootstraps in all splits, e sum of bootstraps in equal splits, d sum of bootstraps in 
different splits, Ma mean bootstrap value, Mẹ mean bootstrap value in equal 
splits, Mg mean bootstrap value in different splits 


3.1.4 Testing the BSD 
Method 
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bootstrap values are compared, using a strict threshold restricts the 
comparison to a small subset of robust branches, resulting in an 
artificially low BSD value. In other words, this procedure artificially 
inflates the similarity between the two trees by depreciating a large 
fraction of the branches. In addition, before considering the use of 
only most supported branches, one should take into account that 
the BSD method already uses bootstrap values to adjust the dis- 
tance between trees, so if two trees are topologically similar (low 
SD) but supported by low bootstrap, the distance value increases 
(higher BSD), which is one of the advantages of the BSD method 
(see Eqs. 2 and 3). 


The performance of the BSD method was compared with that of 
the original SD method implemented in the TOPD/FMTS pro- 
gram [12]. Figure 6 shows the correlation of SD and BSD for trees 
with a number of species from 4 to 15 (a) and from 16 to 
100 (b) from a recent large-scale analysis of the FOL [5]. The 
three-way comparison of SD, BSD, and tree size (number of spe- 
cies) shows a positive correlation between SD and BSD for all tree 
sizes (R? = 0.8613 for trees with 4-16 species and R? = 0.7055 for 
trees with 16-100 species) (Fig. 6c). However, the SD follows a 
discrete distribution, which obviously is most conspicuous in the 
comparisons of small trees (Fig. 6a), whereas, thanks to the use of 
the bootstrap values, the BSD distribution is continuous (Fig. 7). 

Figure 7 shows an example of the comparison (all-against-all) 
of three trees with six species each that differ in one, two, and three 
splits, resulting in SD values of 0.33, 0.66, and 1, respectively 
(Fig. 7a). Also, each tree was compared to itself resulting in a SD 
of 0. Then, bootstrap values were assigned randomly to the trees in 
order to compare the trees using the BSD method, and this proce- 
dure was repeated 1000 times. The resulting plot (Fig. 7b) shows 
that, for the comparison of trees with SD of 0 and 1, the BSD values 
ranged from 0 to 0.5 and from 0.5 to 1, respectively, and in 
principle, could assume all intermediate values. In the case of the 
comparisons that differed in one split (SD = 0.33), the BSD value 
was greater than 0.33 in 75% of the comparison, whereas for the 
comparisons that differed in two splits (SD = 0.67), 25% of the 
BSD values were greater than 0.67. Thus, the BSD method for tree 
comparison offers a better resolution than the SD method, espe- 
cially, for trees with a small number of species. 

Figure 8a shows the results of analysis of six simulated align- 
ments with an increasing level of noise (divergence respect to the 
initial alignment) in each alignment, i.e., from the alignment 
0 (without noise and producing trees with bootstrap values of 
100) to alignment 5 with the maximum level of noise. For each 
alignment, a tree was constructed using the UPGMA method from 
the web server DendroUPGMA  (http://genomes.urv.cat/ 
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Fig. 6 Correlation of BSD and SD from the all-against-all tree comparisons of 6901 phylogenetic trees. (a) 
Trees containing 4—15 species. (b) Trees containing 16-100 species. (c) SD, BSD, and tree size for trees 
containing between 16 and 100 species 


UPGMA). Distances were calculated using the Jaccard coefficient, 
and bootstraps were generated from 100 replicates. The results of 
the tree comparison (Fig. 8b) using three different methods, 
namely, nodal distance (ND), SD, and BSD, show that the BSD 
method presents a continuous distribution resulting in a better 
resolution of the distances than the other two methods. Indeed, 
the SD and ND methods fail to discern the similarity between trees 
after six changes, whereas the BSD method still reports discernible 
similarity (Fig. 8b). In order to compare the three tree comparison 


Distance 
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Fig. 7 Comparisons of trees with six taxa. Bootstrap values were assigned randomly in each comparison 


3.1.5 Analysis of 
Random Trees and the 
Significance of BSD Results 


methods, the distance reported by each method was normalized to 
the maximum value in each case, i.e., after 46 changes (maximum 
number of changes in the simulation), the distance to the initial tree 
is 1.41, 0.30, and 0.42 for ND, SD, and BSD, respectively. All three 
distance values indicate that the trees are similar far above the 
random expectation, supporting the robustness of all methods, 
but the BSD method presents a better resolution in the tree 
comparison. 


To assess the significance of the tree comparison by the BSD 
method, we performed several tree comparisons using random 
trees containing between 4 and 100 species (Fig. 9). Each test is 
an all-against-all comparison of 1000 random trees (for complete 
results see Additional File 2). The results from random tree 
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Fig. 8 Comparison of six trees constructed from alignments with increasing noise levels. (a) Comparison of 
trees from six simulated alignments. The UPGMA tree from each alignment was reconstructed with the web 
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comparison have to be used to determine whether the detected 
similarities or differences between trees are significantly different 
from chance [12]. Figure 9 shows that the distance between ran- 
dom trees monotonically increases with the tree size up to a value of 
approximately 0.75 for BSD and approximately 0.999 for SD. In 
other words, although BSD is an extension of the SD method, the 
results obtained by the two methods are not directly comparable. 
Therefore, to assess whether the similarity between two trees is 
better than chance, one must consider the method used for the 
tree comparison (e.g. SD or BSD) and the size of the tree. For 
example, consider two trees with 15 species each for which the SD 
method reports a distance of 0.75. This value is far below random- 
ness (Fig. 9), so the conclusion would be that the two trees are 
nonrandomlly similar. However, if the same distance value (0.75) is 
reported by the BSD method, the conclusion would be the 
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Fig. 9 Random BSD and SD depending on the tree size. Results of the tree comparison of random trees (with 
different sizes ranging from 4 to 100 species) show that the BSD and SD increase up to 0.75 and 0.999, 
respectively 
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Fig. 8 (continued) server DendroUPGMA (http:/genomes.urv.cat/UPGMA) using the Jaccard coefficient as the 
measure of distance and generating 100 bootstraps replicates. Alignment O corresponds to the initial 
alignment without noise that perfectly separates all branches, resulting in a tree with bootstrap values of 
100 for all internal nodes. Alignments 1 to 5 correspond to the derivatives of the initial alignment with 
increasing noise levels at each step. (b) Results of the comparison of each tree [1 to 5] with the initial tree (0). 
The trees were compared using three methods: split distance (SD), nodal distance (ND), and boot-split 
distance (BSD). For the purpose of comparison, the results obtained with each of the three methods were 
normalized to the maximum value in each case 
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opposite, namely, that the two trees are no more similar than two 
random trees of 15 species. 

Another and probably the most important problem of the 
comparison of phylogenetic trees is how to interpret the results 
from a biological perspective. To address this issue, we generated 
random trees containing from 4 to 100 species and performed 1 to 
100 permutations (swap of a pair of branches) in each tree. The 
resulting tree was then compared with the source tree (Fig. 10a, b). 
The results show the number of permutations required to obtain a 
particular BSD value for different tree sizes (number of species). 
For instance, BSD = 0.3 in the comparison of two trees with 
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Fig. 10 The number of permutations and the BSD. (a) BSD depending on the 
number of permutations and tree size. (b) Mean and standard deviation of the 
BSD for up to 100 permutations for trees with 20 species 


3.2 Analysis of 
Topological Trends in 
a Set of Phylogenetic 
Trees 


3.2.1 Calculation of the 
Tree Inconsistency 
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20 species indicates that the two trees are separated by one permu- 
tation whereas BSD = 0.6 indicates that the trees are separated by 
approximately 9 permutations (for the complete listing of equiva- 
lences between BSD, SD and the number of permutations, see 
Additional File 3). Considering that each permutation corresponds 
to an HGT event, the BSD may be construed as the measure of the 
extent of HGT contributing to the topological difference between 
the compared trees. Given the discrete distribution of SD values, 
this measure cannot be used to infer the number of permutations 
with the same precision as BSD. 


A key characteristic of the FOL is the degree of the topological (in) 
consistency between the constituent trees. To quantify this trend, 
we introduced the inconsistency score (IS), which is the fraction of 
the times that the splits from a given tree are found in all N trees 
that comprise the FOL. Thus, the IS may be naturally taken as a 
measure of how representative of the entire FOL is the topology of 
the given tree. The IS is calculated using Eqs. 5-7, where N is the 
total number of trees, X is the number of splits in the given tree, 
and Yis the number of times the splits from the given tree are found 
in all trees of the FOL. 


EA CH 

IS = Y min 

S TSmax SS 
l 

I min — > AT 

S X.N (6) 
l 
I max — yy I min 

Snax =- IS (7) 


In addition to the calculation of a single value of IS for a given 
tree by comparing its topology to the topologies of rest of trees in 
the FOL, IS can be calculated along the depth of the trees, namely, 
split depth and phylogenetic depth. The split depth was calculated 
for each unrooted tree according to the number of splits from the 
tips to the center of the tree. The value of split depth ranged from 
l to 49 ([100 species/2] — 1). The phylogenetic depth was 
obtained from the branch lengths of a rescaled ultrametric tree, 
rooted between archaeal and bacterial species, and ranged from 0 to 
1. The topology of the ultrametric tree was obtained from the 
supertree of the 102 NUTs using the CLANN program 
[67]. The branch lengths from each of the 6901 trees were used 
to calculate the average distance between each pair of species. The 
obtained matrix was used to calculate the branch lengths of the 
supertree of the NUTs. This supertree with branch lengths was 
then used to construct an ultrametric tree using the program 
KITSCH from the Phylip package [68] and rescaled to the depth 
range from 0 to 1. The resulting ultrametric tree was used for the 
analysis of the dependence of tree inconsistency on phylogenetic 
depth. 


256 Pere Puigbo et al. 


3.2.2 Classical 
Multidimensional Scaling 
Analysis 


3.3 Analysis of 
Quartets of Species 


3.3.1 Definition of 
Quartets and Mapping 
Quartets onto Trees 


The classical multidimensional scaling (CMDS), also known as 
principal coordinate analysis, is the multifactorial method best 
suited to analyze matrices obtained from tree comparison methods 
like BSD and identify the main trends in a large set of phylogenetic 
trees. The CMDS embeds a data points implied by a [n x n] 
distance matrix into an m-dimensional space (m < n) such that, 
for any k E [1, m], the embedding into the first k dimensions is the 
best in terms of preserving the original distances between the points 
[69, 70]. In our analysis, the data points are distances between trees 
obtained using the BSD method. The choice of the optimal num- 
ber of clusters is made using the gap statistics algorithm [71]. The 
number of clusters for which the value of the gap function for 
cluster k + 1 is not significantly higher than that for cluster k (z-score 
below 1.96, corresponding to 0.05 significance level) is considered 
optimal. The CMDS analysis was performed using the K-means 
function of the R package that implements the K-means algorithm. 
The CMDS approach has been previously employed by Hillis et al. 
for phylogenetic tree comparison, with the distances between trees 
calculated using the Robinson-Foulds distance [72]. 


The minimum evolutionary unit in unrooted phylogenetic trees is 
defined by groups of four species (or quartets), and each quartet 
may be best represented by the three possible unrooted tree topol- 
ogies (Fig. 11a). A quartet defined by the set of species A, B, C, and 
D has three possible unrooted topologies: (1) AB|CD, (2) AC|BD, 
and (3) AD|BC. To analyze which quartet topology (QT) best 
represents the relationships among the four species in a quartet, 
each quartet was compared against the entire set of phylogenetic 
trees from 100 species (the FOL). 

For 100 species, there are 3,921,225 quartets and, accordingly, 
11,763,675 topologies (Fig. 11b). A mapping of quartets onto 
trees is produced using the SD method [12]. A binary version of 
this method was employed to compare quartets and trees (a quartet 
is represented in a tree when SD = 0 and not represented when 
SD > 0). Figure 12a shows an example of quartet mapping onto a 
set of ten trees. Here o is a resolved quartet, with the topology of 
supported by eight of the ten trees. By contrast, for q2, three 
quartet topologies are equally supported, i.e., the topology of this 
quartet remains unresolved. 

To analyze which of the three possible topologies best repre- 
sents the almost four million quartets in the FOL, each quartet 
topology was compared with the entire set of 6901 trees, resulting 
in a total number of 8.12 x 10'° tree comparisons (Fig. 11b), and 
the number of trees that support each quartet topology was 
counted for the entire FOL or for the set of 102 NUTs (Fig. 11b). 


3.3.2 Distance Matrices 
and Heat Maps 


3.3.3. The Tree-Net Trend 
(TNT) 
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Fig. 11 Quartets and quartet topologies. (a) Each quartet (q) is defined by a set of 
four species (different colors denote species) and may be represented by three 
possible unrooted tree topologies (qjt). (b) Quartet topologies (QT). In 
100 species, the total number of quartets (Q) is 3,921,225. Each quartet may 
be represented by three distinct QTs, resulting in a total of 11,763,735 QTs. Each 
QT was mapped onto the FOL, i.e., for each QT, it was determined which of the 
three topologies is represented in each phylogenetic tree in the FOL 
(8.12 x 10'° comparisons). Modified from ref. 61 


Using the quartet support values for each quartet, a 100 x 100 
between-species distance matrix was calculated as dj; = 1 — S;,/Q; 
where d;; is the distance between two species, S; is the number of 
trees containing quartets in which the two species are neighbors, 
and Q; is the total number of quartets containing the given two 
species. Then, this distance matrix was used to construct different 
heat maps using the matrix2png web server ([73], Fig. 12b). In 
contrast to the BSD method, which is best suited for the analysis of 
the evolution of individual genes, the distance matrices derived 
from maps of quartets are used to analyze the evolution of species 
and to disambiguate treelike evolutionary relationships and “high- 
ways” (preferential routes) of HGT. 


The quartet-based between-species distances were used to calculate 
the Tree-Net Trend (TNT) score. The TNT score is calculated by 
rescaling each matrix of quartet distances to a 0-1 scale between the 
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Fig. 12 Mapping quartets. (a) Mapping quartets onto a set of ten trees. (b) A 
schematic of the procedure used to reconstruct a species matrix from the map of 
quartets 


supertree-derived matrix (which is taken to represent solely the 
treelike evolution signal, hence the distance of 0) and the matrix 
obtained from permuted trees, with distance values around the 
random expectation of 0.67 (Fig. 13). Two situations may occur 
in the calculation of the TNT score depending on the relationship 
between the distance in the supertree matrix (Ds) and the distance 
in the random matrix (Dr = 0.67). When Ds > Dr (e.g., in 
comparisons of archaea versus bacteria), Arr = (Lë — Dr)/ 
(Ds — Dr), where Spyz is the TNT score and d is the distance 
between the two compared species in the matrix. When Ds < Dr 
(in comparisons between closely related species), 


Snr = 1 — ((d — Ds)/(Dr — Ds)). 


4 Phylogenetic Concepts in Light of Pervasive Horizontal Gene Transfer 


4.1 Patterns in the 
Phylogenetic Forest of 
Life 


The reconstruction of the evolutionary trends in the FOL is based 
on the idea that prokaryotes, effectively, share a common gene 
pool. This gene pool consists of genes with widely different ranges 
of phyletic spread, from universal to rare ones only present in a few 
species [74]. Thus, genes, as the elements of this gene pool, have 
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Fig. 13 The Tree-Net Trend (TNT). The figure shows a schematic of the TNT 
calculation and the rescaling procedure. Modified from ref. 61 


their distinct evolutionary histories blending HGT and vertical 
inheritance (Fig. 14). In principle, the Forest of Life (FOL) encom- 
passes the complete set of phylogenetic trees for all genes from all 
genomes. However, a comprehensive analysis of the entire FOL is 
computationally prohibitive (with over 1000 archaeal and bacterial 
genomes now available and the computational resources accessible 
to the authors, estimation of the phylogenetic tree for each gene 
represented in all these genomes would take weeks of computer 
time) so a representative subset of the trees needs to be selected and 
analyzed. Previously [5], we defined such a subset by selecting 
100 archaeal and bacterial genomes, which are representative of 
all major prokaryote groups, and building 6901 maximum likeli- 
hood (ML) trees for all genes with a sufficient number of homologs 
and sufficient level of sequence conservation in this set of genomes; 
for brevity, we refer to this set of trees as the FOL. In this set of 
almost 7000 trees, only a very small portion of the forest is repre- 
sented by nearly universal trees (Fig. 14). Furthermore, bacterial 
and archaeal universal trees are rare as well, as reflected in Fig. 14 by 
the small peaks around 41 and 59 species, i.e., all archaea and all 
bacteria, respectively. The dominant pattern in the major part of the 
FOL is completely different: the FOL is best represented by 
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4.2 The Nearly 
Universal Trees (NUTs) 
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Fig. 14 The Forest of Life (FOL). The distribution of the trees in the FOL by the 
number of species. Modified from ref. 5 
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Fig. 15 Distribution of the gene functions among the NUTs. The functional 
Classification of genes was from the COG database [62] 


numerous small trees, with about 2/3 of the trees including <20 
species (Fig. 14). 


We define the nearly universal trees (NUTs) as trees for those 
COGs that were represented in more than 90% of the included 
prokaryotes. This definition yielded 102 NUTs. Not surprisingly, 
the great majority of the NUTs are genes encoding proteins 
involved in translation and the core aspects of transcription 
(Fig. 15). Among the NUTs, only 14 corresponded to COGs 
that consist of strict 1:1 orthologs (all of them ribosomal proteins), 
whereas the rest of NUTs included paralogs in some organisms 
(only the most conserved paralogs were used for tree construction 
[5]). The 1:1 NUTs were similar to the rest of the NUTs in terms of 
the connectivity in tree similarity (1-BSD) networks and their 
positions in the single cluster of NUTs obtained using CMDS. 
The 102 NUTs were compared to trees produced by analysis of 
concatenations of universal proteins [49]. The results showed that 


4.3 The Tree of Life 
(TOL) as a Central 
Trend in the FOL 
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most of the NUTs were topologically similar to a tree obtained by 
the concatenation of 31 universal orthologous genes [5 |—in other 
words, the “Universal Tree of Life” constructed by Ciccarelli et al. 
[49] was statistically indistinguishable from the NUTs and showed 
properties of a consensus topology. Not surprisingly, the 1:1 ribo- 
somal protein NUTs were even more similar to the universal tree 
than the rest of the NUTs, in part because these proteins were used 
for the construction of the universal tree and, in part, presumably 
because of the low level of HGT among ribosomal proteins. 


We analyzed the matrix of all-against-all tree comparisons of the 
NUTs by embedding them into a 30-dimensional tree space using 
the CMDS procedure [69, 70]. The gap statistics analysis [71] 
reveals a lack of significant clustering among the NUTs in the tree 
space. Thus, all the NUTs seem to belong to one unstructured 
cloud of points scattered around a single centroid. This organiza- 
tion of the tree space is best compatible with individual trees 
randomly deviating from a single, dominant topology (which may 
be denoted the TOL), apparently as a result of random HGT (but 
in part possibly due to random errors in the tree-construction 
procedure). Therefore, there is an unequivocal general trend 
among the NUTs. Although the topologies of the NUTs were, 
for the most part, not identical, so that the NUTs could be sepa- 
rated by their degree of inconsistency (a proxy for the amount of 
HGT), the overall high consistency level indicated that the NUTs 
are scattered in the close vicinity of a consensus tree, with HGT 
events distributed randomly [5 ]. 

Thus, the NUTs present a unique and strong signal of unity 
that seems to reflect the TOL pattern of evolution. The inconsis- 
tency score (IS) among the NUTs ranged from 1.4% to 4.3%, 
whereas the mean IS value for an equivalent set (102) of randomly 
generated trees with the same number of species was approximately 
80%, indicating that the topologies of the NUTs are highly consis- 
tent and nonrandom [5]. 

To further assess the potential contribution of phylogenetic 
analysis artifacts to observed inconsistencies between the NUTs, 
we analyzed these trees with different bootstrap support thresholds 
(i.e., only splits supported by bootstrap values above the respective 
threshold value were compared). Particularly low IS levels were 
detected for splits with high bootstrap support, but the inconsis- 
tency was never eliminated completely, suggesting that HGT is a 
significant contributor to the observed inconsistency among the 
NUTs (IS ranges from 0.3% to 2.1% and 0.3% to 1.8% for splits with 
a bootstrap value higher than 70 and 90, respectively) [5]. 

Analysis of the supernetwork built from the 102 NUTs [5] 
showed that the incongruence among these trees is mainly con- 
centrated at the deepest levels, with a much greater congruence 
at shallow phylogenetic depths. The major exception is the 
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4.4 The NUTs 
Topologies as the 
Central Trend and 
Detection Distinct 
Evolutionary Patterns 
in the FOL 


unambiguous archaeal-bacterial split that is observed despite the 
apparent substantial interdomain HGT. Evidence of probable HGT 
between archaea and bacteria was obtained for approximately 44% 
of the NUTs (13% from archaea to bacteria, 23% from bacteria to 
archaea, and 8% in both directions), with the implication that HGT 
is likely to be even more common between the major branches 
within the archaeal and bacterial domains [5]. These results are 
compatible with previous reports on the apparently random distri- 
bution of HGT events in the history of highly conserved genes, in 
particular those encoding proteins involved in translation [75, 76], 
and on the difficulty of resolving the phylogenetic relationships 
between the major branches of bacteria [77-79] and archaea 
[5, 80, 81]. More specifically, archaeal-bacterial HGT has been 
inferred for 83% of the genes encoding aminoacyl-tRNA synthe- 
tases (compared with the overall 44%), essential components of the 
translation machinery that are known for their horizontal mobility 
[42, 82]. In contrast, no HGT has been predicted for any of the 
ribosomal proteins, which belong to an elaborate molecular com- 
plex, the ribosome, and hence appear to be non-exchangeable 
between the two prokaryotic domains [42, 76]. In addition to the 
aminoacyl-tRNA synthetases, and in agreement with many previous 
observations ([83] and references therein), evidence of HGT 
between archaea and bacteria was seen also for the few metabolic 
enzymes that belonged to the NUTs, including undecaprenyl pyro- 
phosphate synthase, glyceraldehyde-3-phosphate dehydrogenase, 
nucleoside diphosphate kinase, thymidylate kinase, and others. 


Using the BSD method, we compared the topologies of the NUTs 
to those of the rest of the trees in the FOL. Notably, 2615 trees 
(~38% of the FOL) showed a greater than 50% similarity (P-value 
<0.05) to at least one of the NUTs, being the mean similarity of the 
trees to the NUTs approximately 50% (Fig. 16). For a set of 
102 randomized trees of the same size as the NUTs, only about 
10% of the trees in the FOL showed the same or greater similarity, 
indicating that the NUTs were strongly and nonrandomly 
connected to the rest of the FOL. 

We then analyzed the structure of the FOL by embedding the 
3789 COG trees into a 669-dimensional space using the CMDS 
procedure [69, 70]. A CMDS clustering of the entire set of 6901 
trees in the FOL was beyond the capacity of the R software package 
used for this analysis; however, the set of COG trees included most 
of the trees with a large number of species for which the topology 
comparison is most informative. A gap statistics analysis [69, 70] of 
k-means clustering of these trees in the tree space revealed distinct 
clusters of trees in the forest. The FOL is optimally partitioned into 
seven clusters of trees (the smallest number of clusters for which the 
gap function did not significantly increase with the increase of the 
number of clusters) (Fig. 17). Clusters 1, 4, 5, and 6 were enriched 
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Fig. 16 Topological similarity between the NUTs and the rest of the FOL. 
Percentage of trees connected to the NUTs at a different percentage of similarity. 
Modified from ref. 5 


* p=0.0014 
** p < 0.000001 


Fig. 17 Clusters and patterns in the FOL. The seven clusters identified in the FOL 
using the CMDS method and the mean similarity values between the 102 NUTs 
and all trees from each of the seven clusters are shown. Modified from ref. 5 


for bacterial-only trees, all archaeal-only trees belonged to clusters 
2 and 3, and cluster 7 consisted entirely of mixed archaeal-bacterial 
clusters; notably, all the NUTs form a compact group inside cluster 6. 

The results of the CMDS clustering (Fig. 17) support the 
existence of several distinct “attractors” in the FOL. However, we 
have to emphasize caution in the interpretation of this clustering 
because trivial separation of the trees by size could be an important 
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4.5 The Tree and Net 
Components of 
Prokaryote Evolution 


contribution. The approaches to the delineation of distinct 
“groves” within the forest merit further investigation. The most 
salient observation for the purpose of the present study is that all 
the NUTs occupy a compact and contiguous region of the tree 
space and, unlike the complete set of the trees, are not partitioned 
into distinct clusters by the CMDS procedure. Taken together with 
the high mean topological similarity between the NUTs and the 
rest of the FOL, these findings indicate that the NUTs represent a 
valid central trend in the FOL. 


The TNT map of the NUTs was dominated by the treelike signal 
(green in Fig. 18a): the mean TNT score for the NUTs was 0.63 
(Fig. 19b), so the evolution of the nearly universal genes of 
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Fig. 18 The Tree-Net Trend (TNT) score heatmaps. (a) The 102 NUTs. (b) The FOL 
without the NUTs (6799 trees). The TNT increases from red (low score, close to 
random, an indication of netlike evolution) to green (high score, close to the 
supertree topology, an indication of treelike evolution). The species are ordered 
according to the topology of the supertree of the 102 NUTs. In (a), the major 
groups of archaea and bacteria are denoted. Modified from ref. 61 
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Fig. 19 The Tree-Net Trend in the FOL and in the NUTs. (a) A hypothetical 
equilibrium between the tree and net trends. (b) A schematic representation of 
the tree tendency in the NUTs. (c) A schematic representation of the net 
tendency in the FOL 


prokaryotes appears to be almost “two-third treelike” Ge, reflects 
the topology of the supertree). The rest of the FOL stood in a stark 
contrast to the NUTs, being dominated by the netlike evolution, 
with the mean TNT value of 0.39 (Fig. 19c) (about “60% netlike”). 
Remarkably, areas of treelike evolution were interspersed with areas 
of netlike evolution across different parts of the FOL (Fig. 18b). 
The major netlike areas observed among the NUTS were retained, 
but additional ones became apparent including Crenarchaeota that 
showed a pronounced signal of a non-treelike relationship with 
diverse bacteria as well as some Euryarchaeota (Fig. 18b). The 
distribution of the tree and net evolutionary signals among differ- 
ent groups of prokaryotes showed a striking split among the NUTs: 
among the archaea, the tree signal was heavily dominant (mean 
TNTNuts Archaea = 0.80 + 0.20), whereas among bacteria the 
contributions of the tree and net signals were nearly equal (mean 
TNT Nuts Bacteria = 0.51 + 0.38). Among the rest of the trees in the 
FOL, archaea also showed a stronger tree signal than bacteria, but 
the difference was much less pronounced than it was among the 
NUTs (mean TNTFOL archaea = 0.47 + 0.11 and mean TNT go 1 - 
Bacteria = 0.34 + 0.08). The conclusions on the treelike and netlike 
components of evolution made here are based on the assumption 
that the supertree of the NUTs represents the treelike (vertical) 
signal. We did not perform direct tests of the robustness of these 
conclusions to the supertree topology. However, observations pre- 
sented previously [5] suggest that the results are likely to be robust 
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5 Conclusions 


6 Exercises 


Acknowledgment 


given the coherence of the NUTs topologies as well as the similarity 
of the supertree topology and the topologies of the individual 
NUTs to the “Tree of Life” obtained from concatenated sequences 
of universally conserved ribosomal proteins [49]. 


The analysis of the phylogenetic FOL is a logical strategy for 
studying the evolution of prokaryotes because each set of ortholo- 
gous genes presents its own evolutionary history and no single 
topology may represent the entire forest. Thus, the methods intro- 
duced in this article that compare trees without the use of a pre- 
conceived representative topology for the entire FOL may be of 
wide utility in phylogenomics. 

We have shown that, although no single topology may repre- 
sent the entire FOL and several distinct evolutionary trends are 
detectable, the NUTs contain a strong treelike signal. Although the 
treelike signal is quantitatively weaker than the sum total of the 
signals from HGT, it is the most pronounced single pattern in the 
entire FOL. 

Under the FOL perspective, the traditional TOL concept 
(a single “true” tree topology) is invalidated and should be replaced 
by a statistical definition. In other words, the TOL only makes 
sense as a central trend in the phylogenetic forest. 


1. Calculate the split distance (SD) and boot-split distance (BSD) 
of the following two trees: 
(((A,B)61,C)53,D,E);(((A,C)76,B)38,D,E) 


2. Calculate the inconsistency score of the tree X in the “forest of 
Geer Y. 
= (((A,B),C),D,E) 
= (((A,B),C),D,E), ene EE E ea : a a 
D); (A,B,(C,D); (A,B,(C, ; 
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Abstract 


In the post genomic era, large and complex molecular datasets from genome and metagenome sequencing 
projects expand the limits of what is possible for bioinformatic analyses. Network-based methods are 
increasingly used to complement phylogenetic analysis in studies in molecular evolution, including com- 
parative genomics, classification, and ecological studies. Using network methods, the vertical and horizon- 
tal relationships between all genes or genomes, whether they are from cellular chromosomes or mobile 
genetic elements, can be explored in a single expandable graph. In recent years, development of new 
methods for the construction and analysis of networks has helped to broaden the availability of these 
approaches from programmers to a diversity of users. This chapter introduces the different kinds of 
networks based on sequence similarity that are already available to tackle a wide range of biological 
questions, including sequence similarity networks, gene-sharing networks and bipartite graphs, and a 
guide for their construction and analyses. 


Key words Sequence similarity network, Evolution, Lateral gene transfer (LGT), Metagenomics, 
Gene remodeling, Ecology 


1 Introduction 


An evolutionary biologist is interested in how processes governing 
evolution have produced the diversity of genes, genomes, organ- 
isms, species, and communities that are observed today. For exam- 
ple, a biologist interested in the eukaryotes may wonder what 
symbiotic partners have contributed to their origins and evolution. 
Eukaryotic nuclear genomes are chimeric in nature, encoding many 
genes acquired from their alphaproteobacterial endosymbiont 
[1-3]. However, in recent years, it has been proposed that the 
ongoing gain of genes by both microbial [4-6] and multicellular 
eukaryotes [7, 8] via lateral gene transfer (LGT) has continued to 
contribute to eukaryotic evolution, though to a lesser extent than 
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prokaryotes [9]. A biologist interested in prokaryotes may wish to 
investigate lateral gene transfer to explore the numbers and kinds of 
genes transferred between bacteria, archaea, and their mobile 
genetic elements [10-14]. These transfers are important for under- 
standing the accessory genomes of prokaryotes [15-17]. Further, 
studying gene transfers in real bacterial communities from different 
environments can help to test the effect of LGT on ecology and 
evolution of communities [18]. Given the prevalence of introgres- 
sion [9-11], 19], one interesting question is whether gene transfer 
has led to the formation of novel fusion genes that combine parts of 
genes originating from separate domains of life [20]. An ecologist 
may wish to analyze the distribution of genes and species in the 
environment [21]. A metagenome analyst may need to overcome 
an additional challenge exploring the nature of the large proportion 
of sequences in metagenome datasets that have little or no detect- 
able similarity to characterize sequences and to study the “microbial 
dark matter” [22]. 

High-throughput sequencing technologies present new oppor- 
tunities to investigate these diverse kinds of questions with molec- 
ular data; however, they also present challenges in terms of the scale 
of the analyses. Consequently, a number of network-based methods 
have recently been developed to expand the toolkit available to 
molecular biologists [23], and these have already made major con- 
tributions to our understanding of molecular evolution. Networks 
have been used to shed light on the nature of the “microbial dark 
matter” [24] and used in ecological studies to explore the geo- 
graphical distribution of organisms or genes [25, 26] or the evolu- 
tion of different lifestyles [27]. Their suitability for investigating 
introgressive events has been used to enhance our understanding of 
the chimeric origin of genes in the eukaryotic proteome [28, 29], 
the flow of genes between prokaryotes and their mobile genetic 
elements [30-35], and gene sharing across mobile elements to 
study the transfer of resistance factors [14, 36]. Networks have 
also been used to classify highly mosaic viral genomes [37, 38] 
and identify gene families [39, 40]. These approaches are highly 
complementary to traditional phylogenetic approaches, high- 
lighted by the development of hybrid approaches and phylogenetic 
and phylogenomic networks [34, 41-43]. These hybrid networks 
are beyond the scope of discussion in this chapter but are covered in 
Chapters 7 and 8. 

While the generation and analysis of networks were previously 
limited to biologists with programming experience, tools have 
recently been developed to simplify the process and broaden the 
availability of network analyses of molecular sequence data. This 
chapter introduces the different kinds of networks that are already 
available to biologists and a guide to how these networks can be 
constructed and analyzed for a large range of applications in molec- 
ular evolution. More precisely, this chapter will focus on three kinds 
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of network and the types of analyses that are possible using these 
networks: sequence similarity networks, gene-sharing networks, 
and multipartite graphs [23]. 


2 Sequence Similarity Networks (SSNs) 


A) 


Fasta file 


>Seqli 
MPAWTESTLICRKLNNQTDFI 
>Seq2 
MPAHFESQLICTKLNCQTEFI 
>Seq3 
MPAHFESQLVKTKEVCQTEQW 
>Seq4 
MCAHFPDQLVKTKEVCQTEQW 


Sequence similarity networks are the bread and butter of network- 
based molecular sequence analyses, with a huge range of applica- 
tions in molecular biology. The use of SSNs for molecular sequence 
analysis first came to the fore in the late 1990s and early 2000s, 
when SSNs were suggested as a way to analyze the rapid influx of 
new molecular sequence data due to advances in sequencing tech- 
nology and reduced cost, as well as to predict gene functions and 
protein-protein interactions [39, 44-46]. One of the earliest formal 
and heuristic uses of SSNs was to define the COG groups of 
homologous families and facilitate prediction of the functions of 
large numbers of genes based on homology [39, 40]. The need for 
efficient computation and analyses for large biological databases 
still pervades; however, more recently SSNs have been increasingly 
appreciated as useful approaches to describe complex biological 
systems, including inferring the “social networks” of biological 
life forms [30], producing maps of genetic diversity [27], detecting 
distant homologues [47-49], and exploring gene and genome 
rearrangements [50, 51]. 

A SSN is a graph in which each node is a sequence and edges 
connect any two nodes that are similar at the sequence level above a 
certain threshold (e.g., coverage, percent identity, and E-value) as 
determined by their pairwise alignment (Box 1) (Fig. 1). While the 
principle behind SSN construction is simple, the expression of 
similarity data in this structure can enable the use of powerful 


B) C) 


Pairwise alignment result Network 


MPAWTESTLICRKLNNQTDFI 2 


MPAHFESQLICTKLNCQTEFI 


MPAWTESTLICRKLNNQTDFI 
MPAHFESQLVKTKEVCQTEQW 


MPAWTESTLICRKLNNQTDFI 
MCAHFPDQLVKTKEVCQTEQW 


71% identity 
42% identity 1 


28% identity 


Fig. 1 Constructing a simple sequence similarity network. A set of sequences (protein or DNA) in fasta format 
(a) are aligned in pairs using alignment tools (such as BLAST). These alignments (b) are scored with metrics 
such as the percentage identity between two sequences (the number of identical nucleotides/amino acids 
displayed above) or the E-value of the alignment. In the resulting network (c), sequences are represented as 
nodes. Two sequence nodes are joined with an edge if they can be aligned above a define threshold, with the 
weight of the edge often based on percentage identity or E-value 
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algorithms for graph analyses to study complex biological phenom- 
ena. Construction of a SSN is also frequently the starting point in a 
diversity of further graph analyses. A SSN can be constructed 
directly from fasta formatted sequence files using pipelines, such 
as EGN [52], the updated and faster performing EGN2 (forth- 
coming), or PANADA [53]. Visualization of networks can be 
performed with programs such as Cytoscape [54] or Gephi [55], 
both of which also have a range of internal tools and external plug- 
ins for network analysis. While these programs are useful for the 
visualization and analysis of relatively small networks, it can be 
difficult to load large and complex networks with a lot of edges 
(e.g., >50,000 edges). In these cases the iGraph library offers an 
extremely powerful and well-supported implementation of a broad 
range of commonly used methods for both complex graph genera- 
tion and analysis in R, Python, and C++ [56]. However, using 
iGraph requires knowledge of programming in at least one of 
these languages. An additional package for network analysis in 
Python is NetworkX [57]. It is our goal here to further generalize 
network approaches by explaining how evolutionary biologists with 
less programming knowledge could analyze their data. A list includ- 
ing many of the tools and programs available for SSN generation is 
available at https: //omictools.com. 


Box 1: How to Build Your Own Sequence Similarity 
Network 

1. Dataset assembly. The first and most important step of SSN 
construction is the assembly of a dataset of sequences rele- 
vant to your biological question, usually in fasta format. This 
can be used as the initial input for wizards such as EGN or 
EGN2 [52], which can fully automate the process. The 
nature of the dataset is highly dependent on the research 
question, so here we focus on the practicalities of database 
assembly. To construct the similarity network, all sequences 
in the dataset are aligned against one another in a similarity 
search. This similarity search is often the time-limiting step 
in an analysis, and the total number of searches required is 
quadratic to the number of sequences in the dataset. For 
large datasets, it is useful to benchmark the alignment using 
a subset of the data to estimate the timescale for the align- 
ment. Large datasets can generate huge outputs, not only 
due to the number of sequences but also the length of their 
identifier. One way to reduce the output size is to replace 
each sequence name in the fasta file with a unique integer. 
The use of integers will reduce disk space use and the mem- 
ory consumption for any software used to analyze the 
sequence data. 


(continued) 
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Box 1: (continued) 

2. Similarity search. To generate a sequence similarity network, 
all sequences must be aligned against one another in an all- 
versus-all search, in which the dataset of sequences is 
searched against a database including the same sequences. 
For gene networks, the alignment is usually done with a fast 
pairwise aligner such as BLAST [58, 59] as implemented in 
EGN [52]. Filters are often used to remove low-complexity 
sequences from the search, as these can cause artefactual hits 
(BLAST options --seg yes, -soft-masking true). The BLAST 
method of alignment will be the focus of future discussion in 
this chapter; however, alternatives are available including 
BLAT [60] (also implemented in EGN), SWORD [61], 
USEARCH [62], and DIAMOND [63]. These alternatives 
generally include an option to produce a “BLAST” style 
tabulated output, making them compatible with programs 
commonly used in network analyses. 

Within alignment tools like BLAST, it is possible to 
assign thresholds, such as the maximum E-value of the 
alignment. It is not recommended to set minimal thresh- 
olds for some parameters (such as % sequence identity) 
unless required due to memory constraints so that you 
can generate networks from a single sequence alignment 
with different thresholds for comparison (e.g., compari- 
son of a 30% similarity threshold to a 90% threshold, 
where edges will only be drawn between highly similar 
genes). 

Note: It may be intuitive to use additional CPUs to 
speed up the alignment process; however, in BLAST it can 
be more efficient to split the query file and launch multi- 
ple searches on separate cores instead of using the BLAST 
multithreading option. The pairwise alignment step is 
generally the most time-limiting part of generating a 
SSN, so benchmarking should be used to establish the 
optimal settings for the pairwise and/or determine the 
feasibility of a project given the size of the dataset and the 
available computational resources. 


3. Filtering similarity search results: In an all-versus-all similar- 
ity search, any given query sequence will have a self-hit in the 
corresponding database. For example, with sequences A 
and B, a self-hit is query sequence A matching to sequence 
A in the database, cases of which must be removed prior to 
network construction (Fig. 2). When query sequence A in a 
similarity search is aligned with sequence B in the database, 
often the reciprocal result is also identified (an alignment 
between query sequence B and sequence A in the database). 
These are called reciprocal hits; while the sequences involved 
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Sequences Sequence alignment Network 
Self hits Reciprocal hits Multiple hits 
SH 


Fig. 2 Filtering sequence similarity results for network construction. In the output of an all-against-all 
sequence similarity search, there are a number of features that are often filtered out prior to network 
construction. Self-hits (1/ and 2/), where like sequences are paired in a sequence alignment, are not 
informative to network construction and are removed (highlighted by the red box surrounding the alignments). 
In cases where there are reciprocal hits (3/ and 4/) between two sequences, then only the alignment with the 
highest E-value is retained (highlighted with a green box around the retained alignment) to ensure only one 
edge representing the best possible alignment connects any two nodes in the network. The same is true for 
cases where a sequence has multiple hits against another sequence, such as when it aligns to another 
sequence in multiple positions (5/ and 6/) 


Box 1: (continued) 

are identical, the alignments and scores are not. Retaining 
both hits would generate two different edges between the 
same two nodes in a SSN, so generally only the best results 
from reciprocal hits are retained, based on a score such as the 
E-value (Fig. 2). Finally, a single query sequence may be 
significantly aligned multiple times in different positions of 
the same sequence in the database; however, for SSN con- 
struction only the best BLAST hit is generally retained 
(Fig. 2). The selection of the best BLAST hit is again gener- 
ally often based on the E-value. Removing multiple hits 
against the same sequence allows the generation of an undi- 
rected network where a single edge connects two nodes, 
representing the best possible alignment between these 
nodes. 


4. Thresholding and network construction: Constructing a SSN 
from a BLAST output is conceptually simple; an edge is 
created between two sequences (nodes) that have been 
aligned in the sequence similarity search. It is common to 
apply thresholding criteria such as minimal % ID and/or 
coverage and/or maximal E-value to determine whether an 
edge is drawn between two sequences in the network 
(Fig. 1). There are different ways to calculate the % coverage 
of an alignment. This could be based on the coverage of a 
single sequence in the alignment, selecting either the query 
or the database sequence in each alignment or the longest or 
shortest sequence in each alignment. Alternatively both 
(mutual coverage) can be used, retaining an alignment 
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Box 1: (continued) 

when both values are above a given threshold. Edges above 
the thresholding criteria can be assigned a weight based on 
these criteria, producing a weighted sequence similarity net- 
work that retains information of the properties of the align- 
ment between two sequences (Fig. 1). It is often useful to 
construct and compare several SSNs with variable stringen- 
cies defining the edges between sequences, for example, to 
optimize gene family detection within the SSN (discussed 
below). 


As with other computational approaches, the scale of network 
analysis is limited by the available computational resources. The 
limiting factor in terms of the size of network it is possible to 
construct is predominantly governed by the pairwise alignment. 
All sequences in the dataset need to be aligned against one another 
in a pairwise manner, meaning the number of alignments is qua- 
dratic to the size of the dataset. For example, computing an all- 
against-all comparison of 1,000,000 sequences requires computa- 
tion of 10” alignments. BLAST [64] is the standard tool for this 
step, with a relatively good speed and accuracy for sequence simi- 
larity searches; however, the use of BLAST can be a bottleneck for 
the analysis of large datasets. This is an especially important consid- 
eration given the growth in the number of gene and genome 
sequences available in public databases. Several rapid alignment 
tools such as BLAT [60], USEARCH [62], Rapsearch [65], and 
Diamond [63] have been proposed to overcome this issue. For 
example, Diamond benchmarks suggest that it is almost as accurate 
as BLAST but is at least three orders of magnitude faster. 

A second point to consider from the perspective of scalability is 
the complexity and size of the graph and the complexity of the 
algorithms used in their analysis. Algorithms where the number of 
calculations is linear to the size of the graph can generally be run on 
huge graphs with sufficient computational resources, for example, 
finding connected components using the “deep search first” algo- 
rithm. Algorithms for community detection (e.g., PageRank [66], 
Louvain) are also linear and particularly suited for detecting groups 
of closely related sequences in huge graphs (discussed in Subhead- 
ing 4). In contrast, computing graph statistics such as the between- 
ness centrality are not linear to the size of the graph, even using the 
relatively efficient Brande algorithm for calculation [67], and are 
therefore more difficult to calculate for huge graphs. This has led to 
the development of toolkits specifically designed for the analysis of 
huge graphs tee. NetworKit) [68]. A recent book summarizes the 
challenges of the analysis of huge networks and some of the algo- 
rithms that have been developed to face these challenges [69]. 
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2.2 Exploiting 
Sequence Similarity 
Networks for 
Identification of Gene 


Families 


weights 


A gene family is usually defined as a group of sequences that are 
similar at the sequence level, indicative of homology and potentially 
of shared functions; however, there is no uniform way to define this 
similarity [70, 71]. One of the early contributions of SSNs in 
molecular sequence analysis was the construction of the COG 
database of homologous protein sequences [39, 40]. This study 
attempted to define gene families based on similarity at the 
sequence level using the results of sequence similarity searches. 
Within the results of an all-versus-all BLAST search, groups of at 
least three proteins encoded by different genomes that were more 
similar to each other than they were to other proteins found in the 
same genomes were defined as a likely orthologous gene family. 
Orthologous gene families are group of genes in different genomes 
that show sequence similarity, likely as a result of their shared 
evolutionary history. 

The idea of using graphs to identify gene families is now a core 
part of many graph-based analyses. Members of a gene family 
aggregate in a sub-network in a SSN. These sub-networks are called 
connected components (CCs) at these defined thresholds, i.e 
clusters of nodes connected by edges either directly or indirectly 
(via intermediate nodes) (Fig. 3). The size (number of nodes and 
edges in a CC) and density (the proportion of potential connec- 
tions between all nodes in a CC that are actually connected by edges 
in the graph) of CCs will depend on the thresholds used for 


A) One giant connected component at low threshold 


All against all alignment 


B) Three connected components at high threshold 


Filtering significant hits 
(E-value, coverage, 
A of identity) 


C) Three communities with weighted Louvain 


Weighted | 


high sore 
lower score 


| Building network & analysis 


Fig. 3 Louvain community detection in a sequence similarity network. The network is assembled from the 
results of an all-versus-all alignment, as previously described. Edges can be weighted by E-value, percentage 
of identity, or bitscore. For the purpose of simplification, we consider strong or weak weights rather than 
actual values. (a) A giant connected component at relaxed threshold. (b) Three connected components at a 
more stringent threshold. (c) Three communities with Louvain clustering algorithm, taking into account edge 
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constructing the SSN as well as the relationships between sequences 
in the network. For example, for a given dataset at a given mutual 
coverage threshold, a threshold of 90% sequence identity will iden- 
tify a large number of small connected components that only 
include highly similar genes, while at a threshold of 30% sequence 
identity, there will be fewer but larger connected components 
including genes with more variation in sequence similarity. Com- 
monly used thresholds for detecting homologous gene families are 
an E-value <e—5, mutual coverage >80%, and a percentage of 
identity >30% [23]. 

CCs are often detected in a SSN using the Depth-First Search 
(DES) algorithm; however, there are also other approaches for the 
detection of gene families based on the idea of detecting “commu- 
nities” [72]. In some cases, a CC can be further separated into 
communities of sequences that share more similarity to one another 
than to other sequences in the CC and thus are more highly linked 
in the SSN (Fig. 3). Communities are commonly identified by 
using graph clustering algorithms such as Louvain [73], MCL 
[74], or OMA [75]; however, different clustering algorithms will 
result in different outputs. The Louvain weighted method is widely 
used because it is simple to implement and scales very well to large 
graphs (Figs. 3 and 4) [73]. MCL is a strong deterministic algo- 
rithm that has been implemented, for example, in tribeMCL [74] 
and orthoMCL [76]. A potential drawback of MCL is that it 
requires user specification of the “inflation index,” a parameter 
which controls cluster granularity (or “tightness”). A high inflation 


Fig. 4 Giant connected component before and after community detection. (a) A single giant connected 
component from a sequence similarity network. (b) The same giant connected component after application 
of a community detection algorithm. Node colors correspond to the newly assigned communities 
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index increases the tightness of clustering, producing a larger num- 
ber of clusters that are smaller on average than those that would be 
obtained clustering the same dataset using a low inflation index. 
Selecting an appropriate inflation index is not trivial and requires 
optimization [74]. 

A number of the above approaches have been used to compile 
additional databases of orthology that can act as useful reference 
datasets. OMA is a program that uses graph-based algorithms and 
exact Smith-Waterman alignments to identify orthology between 
genes [77-80]. OMA is also available as a web browser [81 ] includ- 
ing a database of orthologues that, in 2015, included more than 
2000 genomes and more than seven million proteins [75 ]. SILIX is 
a software package [82] that aims at building families of homolo- 
gous sequences by using a transitive linkage algorithm, and 
HOGENOM [83] is a database that contains families inferred by 
SILIX for seven million proteins. 

In addition to clustering genes into families, valuable informa- 
tion can be extracted from the connected components using net- 
work metrics. Highly conserved sequences tend to form CCs where 
most of the nodes are connected to each other by edges, while 
sequences from more divergent families will tend to form more 
sparsely interconnected CCs. This information can be easily 
assessed for each component using the clustering coefficient. Con- 
served families will have a clustering coefficient close to 1, even for 
stringent thresholds. Identifying such conserved families can be 
useful to produce multiple sequence alignments (MSA) needed 
for phylogenetic reconstruction, but SSNs have also been demon- 
strated to unravel relationships between distant homologues by 
linking distantly related sequences together [24, 29, 48]. In a 
SSN, two distant sequences A and C which do not share similarity 
according to BLAST can be linked together due to sequence B 
which shows similarity to both A and C. 

The idea of distant homology has been particularly illuminating 
regarding chimeric organisms such as eukaryotes which carry 
homologous genes inherited from a bacterial ancestor and from 
an archaeal ancestor [29]. A common way to analyze sequence 
similarity networks is to identify certain “paths” of interest, for 
example, the shortest possible paths between two nodes. This 
notion describes the path between two nodes in a connected com- 
ponent that minimizes the sum of the edge weights. Alvarez- Ponce 
et al. used this approach to explore the topology of connected 
components in a SSN including the complete proteomes of 
14 eukaryotes, 104 prokaryotes (including archaea and bacteria), 
2389 viruses, and 1044 plasmids. Eight hundred and ninety-nine 
CCs contained sequences from all three domains, and of these 
208 contained eukaryotic sequences that were not directly similar 
to one another but only linked to one another via a “eukaryote- 
archaea-bacteria-eukaryote” shortest path. These are putatively 
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distant homologues in eukaryotes that were present in both the 
archaeal host of the mitochondrial endosymbiont and in the alpha- 
proteobacterial endosymbiont, with both copies subsequently 
retained in eukaryotes and as such strong evidence for the chimeric 
origin of eukaryotes [29]. This demonstrates the utility of networks 
in the study of ancient evolutionary relationships including the 
origin of eukaryotes [28] or rooting the tree of life [84]. Simple 
path analysis for a network is possible using existing plug-ins within 
visualization tools such as Cytoscape [54] and Gephi [55]. 


When discussing identification of gene families, we have focused on 
networks where edges are drawn between protein sequences that 
show a high enough similarity across their entire length, defined by 
a high mutual coverage threshold (e.g., 80%). Sequence similarity 
can also be partial, for example, following gene remodeling or 
“tinkering” [85] producing new combinations of gene domains 
via gene fusion and fission events, or through the de novo sequence 
synthesis of gene extensions, adding to existing sequences. The 
term “Rosetta Stone sequence” was coined to define the formation 
of a new fusion protein in a species as the result of the fusion of two 
proteins that are found separate in another species, with authors 
originally predicting that these fusions could occur between pro- 
teins that physically interact in a common structural complex 
[86]. One of the earliest applications of sequence similarity searches 
to identify fusion proteins was an attempt to predict pairs of pro- 
teins that may physically interact in an organism based on whether 
they could be identified as a single “composite” fusion protein in 
another organism [44]. Beyond predicting protein-protein interac- 
tions, this kind of gene remodeling and recycling of existing gene 
parts has the potential to contribute to the expansion of functional 
diversity in genomes, creating new and unique combinations of 
domains and functions [51, 85, 87-91]. Similarity search-based 
screens have been implemented to identify composite genes and 
genome rearrangements in a range of prokaryotes [92-94], eukar- 
yotes [87, 95-97], and viruses [98]. 

Early attempts to identify composite genes were based on the 
output of sequence similarity searches, but without formalizing the 
results of search methods into a graph structure. The first attempt 
to formalize the problem of identifying “composite” genes in net- 
works was the “Neighborhood Correlation” approach, aiming to 
distinguish genuine multi-domain proteins sharing common ances- 
try (homologues) from novel multi-domain proteins that share 
domains due to insertions [99]. The later development of the 
FusedTriplets and MosaicFinder tools attempted to unify existing 
graph-based methods for detection of “composite” gene detection 
[50]. FusedTriplets is a graph-based implementation of the tradi- 
tional gene-centered method for composite gene identification, 
originally introduced by Enright et al. [44], with additional cross- 
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Cc) 


Fig. 5 Composite gene identification using “minimal clique separators.” (a) A multiple sequence alignment of 
composite genes (yellow) with two components (blue and magenta). (b) The sequence similarity network 
corresponding to the multiple sequence alignment. The composite genes (yellow) are a minimal clique 
separator for the network. Their removal (shown in c) decomposes the network to the two separate 
component families 


checks on the absence of similarity between the two component 
genes contributing to a composite gene based on varying thresh- 
olds [50, 100]. MosaicFinder is a gene family-centered approach 
which will only identify highly conserved composite gene families 
that form “minimal clique separators” (Fig. 5) [50]. This graph 
topology implies that MosaicFinder may fail to detect divergent 
(e.g., ancient or fast evolving) composite gene families which will 
tend to form “quasi-cliques” without perfect separation. Compo- 
siteSearch [101] (available at http://www.evol-net.fr/index.php/ 
en/downloads) is a new program designed to overcome this limi- 
tation by identifying both conserved and divergent composite gene 
families (Box 2). 


Box 2: How to Identify Composite Genes Using 
CompositeSearch 
1. BLAST search and filtering: An all-versus-all BLAST search is 
carried out as described in Box 1. Filters can be applied on 
the E-value and sequence similarity but should not include a 
mutual query coverage threshold. 


2. CompositeSearch. CompositeSearch takes a filtered BLAST 
output and a list of genes as the initial input. Two search 
algorithms are implemented: “fastcomposites” detects a list 
of potential composite genes and “composites” additionally 
detects potential composite gene families and component 
gene families. Additional options are included to filter the 
network based on a number of standard metrics (e.g., E- 
value, sequence similarity, mutual coverage) and set the 
maximum overlap allowed between different components 
aligned on the same potential composite gene. The defini- 
tion of a maximum overlap allows adjustment for the 
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Box 2: (continued) 

tendency of BLAST to produce overhanging alignments 
[100]. The output includes a node, edge, and information 
file including information on number of nodes, edges, and 
family connectivity from family detection. Two outputs are 
included for composite gene detection, a “composites” file 
with detailed information on each predicted composite gene 
in fasta format and a “compositesinfo” file, summarizing the 
data. Similarly, two files provide detailed information on 
composite gene families and a summary of composite gene 
families. 


3. Filtering results: By default, CompositeSearch outputs all 
possible composite genes in “fast” mode or composite 
gene families in the full mode. These are given alongside a 
number of different metrics designed to help to filter families 
for more confident predictions, including the gene family 
size, number of composites directly predicted within the 
gene family, the number of domains, the number of compo- 
nent families, the number of singleton component families 
(families including only one sequence), the connectivity of 
the family, and a score based on the overlap between differ- 
ent components mapped to the composite gene. 


Recent studies have explored composite gene formation as a 
source of innovation by “tinkering” [85] during major evolution- 
ary transitions. These can be especially interesting when exploring 
genome evolution following introgression, raising the possibility of 
formation of new composite genes using components with differ- 
ent evolutionary origins [20, 51, 102]. For example, the gain of a 
cyanobacterial endosymbiont at the origin of photosynthetic eukar- 
yotes was accompanied by the transfer of whole cyanobacterial 
genes to its new host genome, with gene functions related to the 
role of the plastid [103-105]. Identification of composite genes 
related to the origin of photosynthetic eukaryotes unraveled novel 
symbiogenetic composite genes, and unique fusions of genes 
encoded in the nucleus of photosynthetic eukaryotes that included 
components derived from the plastid endosymbiont. As with whole 
genes transferred to the nucleus, several of these components had 
predicted functions related to the role of the plastid, including 
redox regulations and light response [51]. 


Ecological studies increasingly involve the assembly, analysis, and 
comparison of large metagenome datasets. In addition to identifi- 
cation of functions and organisms associated with a particular 
environment, these studies enable the investigation of important 
hypotheses in microbial ecology at the level of organism or 
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function, such as the often quoted hypothesis that “everything is 
everywhere, but the environment selects” from Bass Becking: the 
idea that microbial lineages are limitlessly dispersible in the envi- 
ronment, but the environmental conditions will select for certain 
lineages and control their distribution rather than any specific 
geographical separation [21 ]. 

Networks are useful for these kinds of ecological studies because 
existing graph algorithms can be used to investigate the structure of 
the network. When investigating gene (or gene-sharing networks), 
it is possible to distinguish nodes by labeling them based on their 
properties, such as categories for taxonomic or environmental ori- 
gins (Fig. 6). A simple way to represent this visually is to color nodes 
based on these properties in Cytoscape or Gephi. A formal way to 
explore the relationships between node properties is to use network 
metrics such as conductance [106], modularity [73], and assorta- 
tivity coefficient (normalized modularity) [107]. Assortativity and 
conductance are different metrics that attempt to answer the same 
type of question: do nodes labeled as belonging to a particular 
category, such as environmental origin, tend to be connected with 
other nodes labeled as belonging to the same category? More pre- 
cisely, conductance quantifies whether a given category of nodes 
shares more edges between themselves than with nodes from differ- 
ent categories. A low conductance approaching zero indicates that 
nodes ofa given category are highly connected to one another, with 
few connections to nodes from different categories. A higher con- 
ductance is indicative that nodes of this category tend to be more 
sparsely interconnected and share more connections with nodes 
from different categories. Assortativity is a measure of the prefer- 
ence for a category of nodes in a network to attach to other nodes 


Single category cluster Structured communities Unstructured: Widespread dispersal 


Fig. 6 Exploring distribution of annotations in sequence similarity networks. In this example, nodes within a 
single connected component are assigned two colors, blue and yellow, corresponding to their having a 
different categorical annotation (e.g., originating from a different environmental source). Using the example of 
environmental source, genes in cluster A would all have the same environmental source (blue), indicating an 
environment-specific cluster of genes. Genes in cluster B are found in two different environmental sources 
(blue and yellow); however, nodes of the same type are preferentially linked to each other in the network than 
to genes from different environmental sources. This would result in a positive assortativity coefficient 
approaching 1 for environment and a low conductance score, suggesting a strong environmental community 
structure. Genes in cluster C are also found in two different environmental sources; however, there is no clear 
pattern for the distribution of genes with regard to environment. This network would have an assortativity 
approaching 0 and a high conductance score 


2.4.1 Assortativity as a 
Tool to Study Geographical 
and Habitat Distributions of 
Microbes and Genes 


2.4.2 Conductance in the 
Comparison of Lifestyles 
and Evolutionary Histories 
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from the same category. Normalized assortativity values range 
between —1 and 1, where 0 indicates random distribution of cate- 
gories within the network, 1 indicates that nodes from the same 
categories tend to be connected to one another in the network, and 
—1 indicates that nodes from different categories tend to be 
connected in the network. A detailed description of the algorithms 
used in these calculations can be found in [108]. 


Forster et al. used assortativity (among other network statistics, 
including the previously discussed shortest path analysis) to explore 
the geographical dispersion patterns of marine ciliates in a network 
generated from ciliate SSU-rDNA sequences [25]. Sequences were 
clustered into two different levels of gene family—CCs and Louvain 
communities (LCs) as previously described. Sequences were 
assigned categorical labels based on their geographical point of 
origin (eight locations) or habitat of origin (three habitats), and 
assortativity was calculated. If sequences, and thus species, are 
broadly distributed across geographical categories, then assortativ- 
ity of SSU-rDNA sequences labeled with these geographical cate- 
gories would be low because similar sequences would be found in 
different environments. Contrarily, if similar sequences tend to be 
from the same geographical category, indicative of endemism, then 
assortativity of sequence geographical origin will be high (Fig. 6). 
The majority of CCs and LCs showed a positive assortativity for 
geographical origin, higher than expected by chance, indicative of 
geographical community structure as opposed to global dispersal of 
ciliates. Similar approaches were used by Fondi et al. and applied to 
a collection of environmental metagenome samples to test the 
“everything is everywhere” hypothesis at the gene pool and func- 
tional level. Gene pools were more strongly associated with a 
particular ecological niche than with specific geographical location, 
supporting the idea that microbial genes are found everywhere but 
the environment selects for them [26]. 


Conductance is used to explore the clustering of pairs of different 
node categories in a connected component. In a study by Cheng 
et al., the proteomes of 84 prokaryote genomes were categorized 
into four broad redox groups based on their lifestyle, methanogens, 
obligate anaerobes, facultative anaerobes, and obligate aerobes 
[27]. For each CC in a pan-proteome sequence similarity network 
including all 84 genomes, the conductance was calculated for pairs 
of redox categories and compared to values obtained following 
random relabelling of the components. The distributions of con- 
ductance values for methanogens and for obligate anaerobes 
groups indicated that the sequences in these groups have features 
distinct from those in other groups, that anaerobes and aerobes 
tend to be dissimilar, and that their sequences are more isolated 
from one another in the SSN than expected by chance. 
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2.5 SSNs in Remote 
Homologue 
Identification: 
Shedding Light on the 
Microbial Dark Matter 


2.6 Exploiting SSNs 
to Analyze 
Classifications 


An additional example of the use of conductance is in exploring 
the propensity of a gene family to lateral gene transfer. Within a 
network of archaeal and bacterial genes, CCs showing a low con- 
ductance for both archaeal and bacterial sequences indicate that the 
bacterial and archaeal genes within the corresponding families are 
structured in two separate and conserved groups (Fig. 6). Structur- 
ing gene families into two groups would indicate that there was 
little or no evidence for lateral gene transfer between archaea and 
bacteria within this particular gene family. This kind of gene family 
is rare, with only 86 gene families from 40,584 (0.2%) meeting this 
criteria [24]. 


Up to 99% of microbial species are not cultivable and thus have not 
been studied in isolated culture. Analysis of high-throughput 
sequencing and metagenomics datasets has shed light on these 
uncultivable organisms, often referred to as the “microbial dark 
matter” [109], and in some cases enabled the reconstruction of 
draft genomes [110-114]. A considerable portion of most meta- 
genome studies have predicted ORFs showing no detectable simi- 
larity to any known proteins, termed metaORFans [115]. These 
can represent 25-85% of the total ORFs identified in metagenomes 
[22]. Identifying distant homologues of ORFans may help to pre- 
dict their functions and begin to unravel the microbial dark matter. 
Recent work by Lopez et al. in 2015 probed the microbial diversity 
of metagenome datasets from a range of environments including 
the human gut microbiome, identifying homologues of genes from 
86 ancient gene families that are distributed across archaea and 
bacteria. The majority of these gene families included environmen- 
tal homologues that were highly divergent from any of their 
cultured homologues, and many branched deeply with the phylo- 
genetic tree of life, highlighting our limited understanding of 
diverse elements of the microbial world and hinting at the existence 
of yet unknown major divisions of life [24] (Fig. 7). 


Metagenomic and genomic data are providing scientists with a 
tantalizing amount of sequence data, casting the analysis of the 
extent of biodiversity as a major research theme in biology 
[116-120]. In theory, existing organismal and viral classifications 
are invaluable tools to structure and analyze this biodiversity. How- 
ever, the way taxonomical classifications are constructed raises 
questions about their naturalness and their actual application 
scope [38, 120-128], in particular regarding genetic diversity 
surveys. There are three major reasons for this. First, organismal 
and viral diversity is still largely undersampled, which means that 
existing classifications are incomplete [119, 120]. Therefore, taxo- 
nomically unassigned sequences cannot be readily used in class- 
based genetic diversity surveys, since this dark matter remains 
outside existing classes. Second, classifications are constructed 
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Average % identity > 60% 
O Max % identity to homologues in databases > 60% 
@ Max % identity to homologues in databases < 60% 


Fig. 7 Remote homologue detection to help characterize the microbial dark 
matter. (a) A hypothetical highly conserved cluster of genes from genomes 
present in sequence databases, where the average % of identity is high 
(>60%). (b) The same cluster after addition of divergent environmental 
sequences to the network. Environmental sequences in gray are more similar 
to those already identified from genome surveys (>60% max identity) so are 
connected directly to the conserved gene cluster in the network. More divergent 
sequences in pink have <60% maximum identity to their homologues in the 
database. Many of these are only identified as linked to the sequences from the 
conserved database via intermediate gray nodes. This is the notion of “transitive 
homology” 


using different features (e, for viruses, a mix of phylogenetic, 
morphological, and structural criteria, such as replication proper- 
ties in cell culture, virion morphology, serology, nucleic acid 
sequence, host range, pathogenicity, epidemiology, or epizootiol- 
ogy); therefore their classes do not necessarily offer immediate 
proxies for quantifying genetic diversity per se. Third, evolutionary 
processes responsible for both genetic and organismal diversity are 
diverse, and they operate at different tempos and modes in different 
lineages [49, 123, 129-141]. As a result, genetic diversity within 
classes and between classes can be heterogeneous, meaning that 
existing classifications may lack efficiency to discriminate, predict, 
or compare taxa on genetic bases, potentially hampering diversity 
studies, a profound practical issue at a time where the analysis of 
metagenomic sequences is becoming a priority in biology. 
Addressing these challenges is notably crucial for viral studies. 
Recently, the executive committee of the ICTV [142] proposed 
that network analyses methods that create similarity metrics based 
on the detection of homologous genes and their genetic divergence 
constitute a valuable strategy to assist classification of viruses. Con- 
sistently, basic network properties and metrics (Table 1) can quan- 
tify (1) whether genetic diversity is consistent within and between 
the classes of existing classifications and (2) describe what classes are 
the most homogeneous and distinctive in terms of genetic diversity. 
Three criteria can be used to estimate intra-class genetic heteroge- 
neity (Fig. 8a—c). First, the average edge weights (measured as % of 
identity, PID) between pairs of sequences from genomes of the 


288 Andrew K. Watson et al. 


Table 1 


Schematic properties of two extreme kinds of taxonomic classes with respect to their genetic 


diversity 


“Ideal” classes 


Not ideal classes 


Low intra-class genetic diversity (high average PID) High intra-class genetic diversity (low average 


PID) 
High genetic cohesion (high average CCC) Low genetic cohesion (low average CCC) 
Core components (high maxCore%) No core components (low maxCore%) 
Obvious genetic distinctiveness (high conductance Limited genetic distinctiveness (conductance 
difference with random groups) similar to random groups) 
Exclusive pangenome (high % of exclusive CC) No exclusive pangenome (low % of exclusive 
CC) 


The three top properties inform about genetic diversity within classes (intra-class genetic diversity). The last two 
properties inform about the genetic distinctiveness (core and signature genes) of the classes. Interclass genetic heteroge- 


neity identifies when genetic diversity of a class is not comparable with genetic diversity of another class in the 
classification. CCC, average proportion of genetic conservation between sequences from the same cluster and from the 


same taxonomic class; PID, average edge weights (% identity) between two sequences from genomes of the same class 


same class provide a trivial measure of intra-class genetic diversity. 
Second, the average proportion of Conserved Canonical Connec- 
tions between sequences from the same connected component and 
from the same taxonomic class can be exploited (CCC, i.e., in each 
connected component of the SSN, the total number of edges 
connecting sequences of a given class i (intra-group edges, denoted 
Ej) divided by the theoretical maximal number of possible edges 
between sequences of that class in the connected component (CCC 
(i) = 2*E;/(N; x (N; — 1)) where N; is the number of sequences of 
class 1 present in the connected component). CCC ranges between 
0 and 1. Within a connected component, if all pairs of sequences 
from the same class are directly connected, CCC equals 1, since all 
these sequences are more conserved than a given %ID threshold. By 
contrast, low CCC are observed when sequences from genomes 
from the same class lack cohesive evolution, for example, when 
some related sequences evolved so fast that they show less than 
the minimal similarity required to be directly connected to their 
homologues in the graph. Third, the genetic consistency of a class 
can be estimated by (1) identifying what cluster of sequences was 
present in the largest number of genomes of the class and then 
(2) by quantifying the proportion (in %) of the class members 
harboring that most ubiquitous cluster (maxCore%). When max- 
Core% ofa class is <100%, it means that, for this dataset, there is no 
gene family shared by all members of that class (i.e., no core genes). 
The SSN structure can also serve to estimate the genetic distinc- 
tiveness of each class, i.e., whether sequences from a given class are 
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Sequences are represented by nodes. Each node is colored 
to represent the class to which the host of its corresponding 
sequence belongs. Nodes with the same color belong to the 
same class. Edge weight is represented by edge size 
proportional to the weight. 


High assortativity Low assortativity 


Fig. 8 Intra- and interclasses heterogeneity measurements in weighted similarity networks. Sequences are 
represented by nodes. Each node is colored to represent the taxonomic class to which its host belongs. Nodes 
with the same color belong to the same class. Edge weight is represented by edge size proportional to the 
weight. Subgraphs correspond to clusters of sequences. Direct neighbors have a greater similarity than the 
threshold set to allow such connections. PID, average edge weights (% identity) between two sequences from 
genomes of the same class; CCC, average proportion of genetic conservation between sequences from the 
same cluster and from the same taxonomic class; maxCore%, conductance; and %-exclusive components 
correspond to the estimates used to assess genetic consistency of classes 


more similar to one another than they are to sequences from other 
classes (Fig. 8d, e). Such sequences could be used as classificatory 
features to assign members to the class. In a SSN, this property 
translates to a low ratio of interclass edges over intra-class edges and 
is measured by conductance (Fig. 8d). Likewise, the proportion of 
clusters comprised exclusively of sequences from one class, a diag- 
nostic feature of the class, provides an estimate of the class genetic 
distinctiveness. Genetically highly distinct classes have a high % of 
such exclusive clusters. Based on these network measures, interclass 
genetic heterogeneity can simply be diagnosed by contrasting esti- 
mates of genetic consistency for all the above measures for each 
class. There is interclass heterogeneity within a classification when 
the mean PID, mean CCC, maxCore%, DRC, and % of exclusive 
components differ between classes. 

Such network analyses show that virus classifications face a 
pragmatic issue: overall genetic distinctiveness allows relatively 
safe assignments of viral sequences to existing classes; however, 
genetic diversity of viral taxa of similar ranks differs among the 
tested classifications. Therefore, virus classifications (especially 
ICTV classification at the family level) should be used carefully to 
avoid inaccurate estimates in metagenomic diversity surveys. Clas- 
ses with broader genetic diversity will tend to be more easily 
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detected in the environment than classes with reduced genetic 
diversity, since the former will necessarily be associated with more 
OTUs than the latter. Some alpha- and beta-diversity analyses of 
environmental data, which rely on counts and on contrasts of the 
abundance of taxonomic classes in different samples, will thus also 
be biased. A similar approach could be applied on different types of 
classified lineages, i.e., to identify what groups of bacteria, archaea, 
or eukaryotes with comparable taxonomical ranks are the most 
genetically heterogeneous and what ranks of their classification 
are the least genetically consistent. 


A  Gene-Sharing Networks 


Gene-sharing networks are often called “genome networks” as they 
are best suited for summarizing what genes are shared between 
different genomes, highlighting routes of gene sharing. The ability 
to explore gene sharing between all genomes in a network in a 
simple graph can have useful properties for reflecting microbial 
social life, inherently inclusive of gene sharing both as a conse- 
quence of vertical inheritance and lateral gene transfer (LGT). 
Bacteriophage and plasmid genomes are typically highly mosaic in 
nature due to a high level of horizontal gene transfer, making it 
difficult to classify their genomes [37, 143]. Lima-Mendez et al. 
proposed the use of gene-sharing networks as a new classification 
method that tackles this problem of mosaicism by classifying viruses 
based on their genome’s content [37]. Constructing gene-sharing 
networks using subsets of genes from different functional cate- 
gories of genes can also be useful in exploring what kinds of genes 
are being shared by different genomes. 

In a gene-sharing network, each genome is represented by a 
node, and two nodes are connected by an edge when the two 
corresponding genomes share homologous genes or gene families 
(Fig. 9). These gene families can be identified from SSNs (of as CCs 
of LCs) or by alternative methods. In gene-sharing networks, edges 
can be weighted by the number of genes or gene families shared 
between the genomes. In this way, gene-sharing networks enable 
the study of microbial social life, quantitatively displaying the gene 
families shared between genomes both as a result of vertical trans- 
mission and lateral gene transfer. 

Gene-sharing networks are useful tools for exploring overall 
patterns of gene sharing between genomes. Recently, Lord et al. 
developed BRIDES, a software package that specifically identifies 
different kinds of patterns in evolving gene-sharing networks after 
the addition of new genome nodes [144]. However, in gene- 
sharing networks the kind of gene families that are being shared is 
often overlooked. To explore how functions are shared between 
different genomes, gene-sharing networks can be built from genes 


3.1 Classification of 
Entities Using Gene- 
Sharing Networks 
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Fig. 9 Translating gene networks to gene-sharing networks. (a) Gene network 
for three gene families. Gene nodes are colored based on their genome of origin. 
The background color corresponds to the gene family color in part c. (b) The 
gene-sharing network corresponding to the gene network in a. Edges are 
weighted on the number of gene families shared by the genomes. (c) 
Multiplex gene-sharing network corresponding to the gene network in a. 
Genomes are connected by multiple edges with colors corresponding to 


different gene families. These edges are weighted based on the number of 
genes shared between two genomes for each family 


using different subsets of functions (Fig. 10) [29]. An alternative 
form of the gene-sharing network is the multiplex network. In this 
network nodes can be linked by edges of different types, for exam- 
ple, each edge representing a different gene family or different 
functional groups of gene families, thus retaining additional infor- 
mation compared to a simpler gene-sharing network (Fig. 9) 
[23]. Multiplex networks can be useful for small-scale analyses; 
however, with large datasets they can rapidly become difficult to 
interpret and analyze. Importantly, multiplex networks are unim- 
odal projections of bipartite graphs (discussed in the Subheading 
14) which can provide greater clarity and have a number of attrac- 
tive properties for the analysis of larger datasets. 


The possibility of summarizing gene sharing between sets of enti- 
ties with complex evolutionary histories means that gene-sharing 
networks can be useful for classifying organisms based on their gene 
content. Lima-Mendez et al. analyzed bacteriophage genomes to 
generate two different phage gene-sharing networks that reflect 
their reticulate evolutionary history [37]. In the first gene-sharing 
network, phage genomes (nodes) were connected by edges when 
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Fig. 10 Functional gene-sharing network reflecting the chimeric nature of eukaryotes. These gene-sharing 
networks describing how genes in different functional categories are shared between bacteria (green), 
archaea (yellow), eukaryotes (gray), plasmids (purple), and viruses (red) from a published dataset [29]. In 
both cases, a giant connected component is shown alongside examples of smaller connected components (a) 
Gene-sharing network for COG category D: cell division control. In this network, sequences of eukaryote origin 
(gray) cluster with bacterial sequences, reflecting their origin in the alphaproteobacterial endosymbiont that 
would become the mitochondrion. (b) Gene-sharing network for COG category K: transcription machinery. In 
this network, eukaryote sequence (gray) cluster with archaeal sequences, reflecting the origin of these genes 
in the archaeal host for the eukaryotic endosymbiont 


3.2 Exploring Routes 
of Gene Sharing in 
Gene-Sharing 
Networks 


they shared significant similarity at the sequence level. This gene- 
sharing network was clustered using the previously discussed MCL 
algorithm [145], identifying distinct groups of phages with 
sequence similarity. Following clustering, membership to a partic- 
ular cluster was reassessed based on shared similarity with viruses in 
other clusters, reflecting their reticulate evolutionary history, allow- 
ing the generation of a matrix assigning a score describing the 
relative membership of any given viral genome to a particular 
classification group. In the second approach, Lima-Mendez et al. 
generated a “module”-based gene-sharing network, where edges 
are drawn between two phage genomes if they share a “module,” in 
this case defined as a group of genes with similar phylogenetic 
profiles, enabling the exploration of what kinds of genes are shared 
between different groups of phages or are “signatures” for a partic- 
ular group of phage genomes [37]. 


Two network metrics, also useful in the analysis of gene networks, 
can be used to attempt to identify “hubs” of gene sharing in the 
context of gene-sharing networks: node “degree” and “between- 
ness.” Both metrics aim to determine the centrality of a node in a 
network. The degree of a node is simply the number of edges that it 
is connected to. The betweenness of a node is the frequency at 
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which it is found in all the possible shortest paths between any two 
nodes in the network. Halary et al. used gene-sharing networks 
based on DNA sequence similarity to explore gene sharing between 
prokaryotes and mobile genetic elements [30]. Plasmids were iden- 
tified as hubs of gene sharing within this pool of genomes, suggest- 
ing that they are key vectors for genetic exchange between cellular 
genomes and a potential DNA reservoir shared by genomes. Phages 
were more peripheral in the network and mostly linked prokaryotes 
from the same lineage. Thus, gene-sharing networks provided 
insights on the evolutionary processes that shape the gene content 
of prokaryote genomes. 

The importance of plasmids in genetic worlds was further high- 
lighted by exploring plasmid gene-sharing networks without inclu- 
sion of prokaryote genomes [14, 36]. Connecting 2343 plasmid 
genomes based on shared gene content in a single graph demon- 
strated that plasmids tended to cluster based on the phylogenetic 
class of their corresponding host prokaryote rather than habitat but 
that more mobile plasmids tended to be more “central” in the graph, 
indicating that these were hubs of gene sharing. Specifically, routes of 
gene sharing for gene families including antibiotic resistance markers 
were identified between actinobacterial plasmids and gammaproteo- 
bacterial plasmids, suggesting that Actinobacteria may act as a reser- 
voir for antibiotic resistance genes for Gammaproteobacteria [14]. 

The finding that plasmids are hubs of gene sharing for prokary- 
ote genomes was supported by analysis of gene sharing in a pro- 
teobacterial phylogenomic network including 329 proteobacterial 
genomes [32]. A phylogenomic network is a type of phylogenetic 
network that has been constructed from fully sequenced genomes. 
In this example the phylogenomic network is an alternative to a 
gene-sharing network, in which genome nodes within a phylogeny 
are linked by edges if they share genes [34]. This study identified 
extensive evidence for lateral gene transfer among Proteobacteria, 
with at least one LGT event inferred in 75% of all gene families. Of 
these putative LGTs, more were related to plasmid-related genes 
than phage-related genes, suggesting plasmid conjugation was a 
more frequent source of gene transfer [32]. Directed graphs explor- 
ing directionality of LGT events between 657 prokaryote genomes 
allowed the polarization of 32,028 putative LGT events finding 
that frequency of recent events correlates with genome sequence 
similarity and most LGTs occurring between donor-recipient pairs 
with <5% difference in GC content, suggesting that there are some 
barriers to lateral gene transfer between prokaryotes but that these 
are not insurmountable [31]. Later reconstruction of transduction 
events linking phage donors and recipients in a phylogenomic 
network demonstrated that LGT by transduction was generally 
highest in similar genomes and between clusters of closely related 
species but that this constraint was occasionally broken, resulting in 
LGTs over long evolutionary distances [35 ]. 
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4 Bipartite Graphs 


Bipartite graphs are excellent at summarizing what genes are shared 
between sets of genomes, and as such are ideal for comparative 
genomics, including for the comparison of genomes reconstructed 
in metagenomic analyses. The potential to extend this approach to 
multilevel graphs, adding additional layers of information such as 
the environment in ecological studies, could provide a powerful 
summary of gene sharing in relatively complex datasets. 

A multilevel network is a network in which edges exclusively 
connect nodes of different types, i.e., representing different levels 
of biological organization. Thus, a bipartite graph is a graph with 
two types of nodes (top and bottom nodes), where edges exclu- 
sively connect nodes of different types (Fig. 11) [146]. The types of 
nodes used can vary widely depending on the biological question, 
from linking diseases (top nodes) to their associated genes (bottom 
nodes) in order to explore the association between related disease 
phenotypes and their genetic causes [147, 148], to exploring the 
concept of flavor pairings in food based on a graph of ingredients 
(top nodes) and the flavor compounds they contain (bottom 
nodes) [149]. For applications in molecular biology, a typical exam- 
ple of a bipartite graph may describe the relationships between 
genomes (top nodes) and gene families (bottom nodes), with 
edges between nodes indicating that a genome encodes at least 
one member of the corresponding gene family (Fig. 11) [23, 33, 
38, 150]. This kind of genome to gene family graph is particularly 
suited for the comparative analysis of the gene content of genomes 
in microbial communities and for exploring patterns of gene shar- 
ing, for example, between distantly related cellular genomes [33] or 
between cellular genomes and their mobile genetic elements (Corel 
et al. forthcoming). It is possible to represent all genes shared 
between a given set of genomes, as a result of both vertical inheri- 
tance and horizontal gene transfer, in a single bipartite graph [23]. 


Bipartite graph Quotient graph 


Genomes Genomes 


‘ 
1 Gene families Twins 
3 


Articulation point 


Fig. 11 A bipartite graph and its reduction to a quotient graph: (a) An example of a bipartite graph displaying 
how five gene families are shared between three genomes. (b) A reduced form of the bipartite graph in which 
gene families are combined to “twin” nodes if they share identical taxonomic distributions. A single 
“articulation point” connects all three genomes 
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This feature was utilized by Iranzo et al. to explore gene sharing 
among the entire dsDNA virosphere, a group of entities typified by 
high rates of molecular evolution and gene transfer [38]. In this 
case, bipartite modularity was identified in the graph to identify 
groups of related viral genomes and their shared genes, with the 
modularity of the graph optimized to Barber’s bipartite modularity 
[151]. A number of additional methods have been developed for 
detection of module structures within a bipartite graph including 
for weighted graphs [152]. Two recently developed tools, 
AcCNET [150] and MultiTwin (forthcoming), have simplified 
the process of constructing and analyzing multilevel graphs without 
the need for custom programming (Boxes 3 and 4). 


Box 3: Generating Gene-Sharing Networks and Bipartite 
Graphs 
1. Dataset assembly: The same rules for dataset assembly as 
described in SSN generation apply to assembling the dataset 
for bipartite and gene-sharing graphs. It is especially impor- 
tant to maintain an annotation file that maps gene IDs to 
their genome of origin. 


2. Definition of gene families: Gene family identification can be 
carried out following the construction of sequence similarity 
networks, as described in Subheading 2. There are a broad 
range of alternative approaches for construction of gene 
families that are beyond the scope of discussion in this chap- 
ter; however, all of these can also be applied to the genera- 
tion of gene-sharing and bipartite graphs. 


3. Network construction: From the definition of gene families, it 
is possible to construct both gene-sharing networks and 
bipartite graphs. 

(a) Ina gene-sharing network, two genomes are connected 
by an edge when they encode genes belonging to the 
same gene family. Generating this kind of network can 
be automated from BLAST or fasta sequence data using 
EGN [52]. 

(b) In a bipartite graph, there are two types of node, 
genome nodes and gene family nodes. An edge is 
drawn between a genome node and a gene family 
node if that genome encodes a member of the gene 
family. ACCNET [150] and MultiTwin (forthcoming) 
tools both include pipelines for generating bipartite 
graphs from sequence data. MultiTwin can also gener- 
ate a bipartite graph from two files: a tab-delimited file 
mapping gene identifiers to their corresponding 
genome identifier and a tab-delimited file mapping 
gene identifiers to their corresponding gene family. 
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Two topological features of bipartite graphs can be used to 
facilitate studies of gene sharing by an exact decomposition of the 
bipartite graph: twins and articulation points [23, 153]. A bipartite 
graph can be reduced to a quotient graph, a reduced variant of the 
bipartite graph where nodes from the bipartite graph have been 
combined based on sharing similar properties without the loss of 
information. For twin nodes (“twins”), this reduction is based on 
the combination of bottom nodes that have identical neighbors 
into a single “twin” supernode in the quotient graph (Fig. 11). This 
is a useful way of reducing the size of large graphs without losing 
information, but twin nodes also have useful properties for graph 
interpretation. The genomes supporting a twin node (its neigh- 
bors) define a club of genomes that share genes, through common 
ancestry and/or horizontal transfer, and the number of gene 
families making up the twin gives a simple description of how 
many gene families are shared between this club. For example, in 
any given dataset, any “core” set of gene families encoded by all 
species in the analysis will be represented by a single twin node. The 
gene families combined in twin supernodes can be viewed as gene 
families that are likely to be transmitted together [23]. An articula- 
tion point is a node that, when removed, will split the graph into 
two or more connected components. Within a gene family- genome 
bipartite graph, articulation points are expected to help to identify 
“public genetic goods,” gene families that are shared by distantly 
related entities that may confer an advantage independent of gene- 
alogy [23, 154], as well as selfish genetic elements such as transpo- 
sases that also spread across multiple genomes. 


Box 4: Considerations for the Construction and Analysis of 
Bipartite Graphs Using ACCNET and MultiTwin 

The default workflow for both ACcNet and MultiTwin takes 
protein sequence data in fasta format as input and generates a 
bipartite graph alongside a number of graph summary statis- 
tics and outputs for visualization in standard tools (such as 
Gephi and Cytoscape) but with a number of important differ- 
ences, including: 


e Graph levels: Both ACCNET and MultiTwin can generate a 
bipartite graph using their default workflow; however, Multi- 
Twin can also be used to explore additional graph levels by 
adding additional node types (e.g., a tripartite graph). Multi- 
partite graphs mean that gene family level annotations can be 
associated with additional levels of biological information. 
This may be particularly useful for the comparison of samples 
in metagenomics studies or time course experiments, allow- 
ing gene families to be associated directly with features such as 
environmental origin or time point. 


(continued) 


4.1 Using Bipartite 
Graphs to Explore 
Patterns of Gene 
Sharing Between 
Diverse Entities 
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Box 4: (continued) 

e Gene family identification: ACCNET uses kClust [155] to 
assemble gene families, a kmer-based method for rapid assem- 
bly of clusters of homologous proteins from sequence data. 
By default, MultiTwin identifies gene families using an all- 
versus-all BLAST search, followed by identification of 
connected components at a given threshold, as previously 
discussed for gene family detection from SSNs. MultiTwin 
can also be used in a modular way allowing for additional 
customization, including the use of any custom gene family 
input in the form of a “community file”: a tab-delimited file 
linking every gene/protein ID to a community identifier, 
with gene families defined using a clustering method of 
choice. 


es Edge weighting: In ACCNET the edge weight is proportional 
to the inverse of the phylogenetic distance between proteins 
in a cluster from a given genome to other proteins within the 
same cluster. In MultiTwin, the default edge weight is based 
on the number of genes present in a gene family from any 
given genome. 


e Graph compression: While both methods can be used to iden- 
tify “twin” nodes, only MultiTwin generates a quotient graph 
from these twin nodes and identifies articulation points. 


AcCNET is available at: https://sourceforge.net/pro 
jects /accnet 

MultiTwin is available at: http: //www.evol-net.fr/index. 
php/en/downloads 


The simplest application of a bipartite graph is the summary of all 
genes shared between genomes in a single parsable graph, and this 
feature has been used to explore gene sharing in the dsDNA virome 
[38], a range of Escherichia coli genomes to investigate the E coli 
pangenome [150] and between a broad range of prokaryotes that 
include newly discovered organisms [33]. In their analysis of pro- 
karyote genomes, Jaffe et al. used the notion of “twins” to explore 
patterns of gene sharing between prokaryotes, including Archaea 
and the recently discovered ultrasmall “Candidate Phyla Radiation” 
and TM6 bacteria with extremely unusual and reduced genomes. 
The group found evidence for lateral gene transfer between ultra- 
small bacteria and other prokaryotes, consistent with the sugges- 
tion that the ultrasmall bacteria may be symbionts [33]. In their 
exploration of the dsDNA virome, Iranzo et al. used graph module 
detection, algorithms designed to identify groups of densely 
connected nodes in a graph, to identify sets of densely connected 
viral genes and genomes that included viruses with broad host 
ranges, as well as 14 hallmark viral genes that account for most of 
the gene sharing between all different viral modules [38]. 
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5 Conclusions 


This chapter has offered a brief introduction to the generation of 
commonly used sequence similarity networks in molecular biology 
and a guide to how they can be generated and applied to a broad 
range of studies (Fig. 12). Networks provide a highly scalable 
framework for the study of an increasingly broad range of applica- 
tions in molecular biology and evolution and have already contrib- 
uted to a number of important discoveries in the field. These 
include exploring patterns of introgression and horizontal transfer 
across all domains of life and mobile elements, the origin of eukar- 
yotes, the contribution of new genes including novel fusion genes 
to major evolutionary transitions, shedding light on the “microbial 
dark matter” in metagenome sequencing datasets and in testing 
ecological hypotheses about organism and gene distribution and 
environmental selection. New methods and tools for network anal- 
ysis are becoming increasingly user-friendly and accessible to biol- 
ogists without extensive programming experience and enabling 
network analysis to become a more common part of a biologist 
toolkit in the analysis of molecular sequence data. 


Sequence dataset ----- 


BLAST, BLAT, DIAMOND, SWORD| Pipelines for SSN generation directly from sequence data 
Sequence alignment 
Pythoscape 


SiLix (Python library with plugins for network generation) 


EFI Web Server 


FusedTriplets 


Gene network -a£-- 


road range of applications 
ComputeCC 
'hoMCL 


(Specific application to specific gene families/superfamilies) 


| EEEE AE DE 


Gene family detection Ort 
GeneRAGE EGN/EGN2 
(A wizard for generation of SSNs and Genome Networks) 
multiTwin 
MosaicFinder MultiTwin 
compositeSearch ECCNET 
(Includes an automated pipeline for generation bipartite graphs) 
EGN/EGN2 
Composite gene identification MultiPartite graph Genome network E- 
Patterns of gene sharing Microbial "social world" 
Cytoscape 
Analysis of smaller graphs / basic network statistics 
Gephi 
Cytoscape 
Gephi J> Analysis* 
Visualisation Graph 
LD y Larger graphs / more complex analysis 


* Many of the packages listed for generating different kinds of graph include tools for their analysis 


and output basic graph statistics 


Fig. 12 A workflow highlighting some of the available routes for generation and analysis of SSNs, gene- 
sharing networks, and bipartite graphs. This workflow highlights just some of the many tools and routes for 
network construction and analysis 


6 Exercises 


Glossary 
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The exercises use EGN [52] and require access to a local installation 
of BLAST+ [58] and Perl. The fasta sequence file “example.faa” 
provided with EGN includes a dataset of protein sequences from 
Archaea, Bacteria, Eukaryota, and mobile genetic elements, avail- 
able at http: //www.evol-net.fr/index.php/fr/downloads: 


l. 


3. 


Perform a manual all-versus-all BLAST using search for a given 
protein sequence file from the unix terminal (requires local 
installation of BLAST). The output can be filtered to generate 
a network: 


(a) Make the blast database using the “makeblastdb.” 
e Command: “makeblastdb -dbtype prot -in example faa 
—out example” 
(b) Performing the BLAST search using “blastp,” remember- 
ing to output data in a tabular format for easy processing. 
e Command: “blastp -query example. faa -db example 
-evalue le-5 -seg yes -soft_masking true - max_target_- 
segs 5000 -outfmt “6 qseqid sseqid evalue pident bitscore 
gstart qend qlen sstart send slen” -out protein.blastpout” 


. Generate a SSN using EGN from example.faa (requires local 


installation of BLAST and download of EGN from http:// 

www.evol-net.fr/index.php/fr/downloads): 

(a) Run EGN from the terminal using “perl een LO. plus pl" 
from the programs home directory. 


(b) Follow on-screen prompts sequentially to generate an 
alignment, filter the output, and generate a gene network 
with outputs compatible with both Cytoscape and Gephii. 


Visualize SSN networks: 
(a) In Cytoscape: Import files named “ce. *.tx¢” as a network 
to visualize that set of connected components. 


e To associate nodes with their annotations, import "ec", 
atr” as a table. 
(b) In Gephi: Open “cc*gxf? files to import individual 
connected components from the network into Gephi. 
Use the “layout” menu to explore different kinds of lay- 
outs for the network. 


Articulation point A node in a graph whose removal increases 


the number of connected components of 
the resulting graph. 


200 Andrew K. Watson et al. 


Adjacency matrix 


Assortativity 


Betweenness 


Bipartite graph 


Club of genomes 


Communities 

(also called modules) 
Composite gene 
Component genes 


Conductance 


A numerical square matrix with row and 
columns labeled by network nodes, with 
l or O in the matrix indicating whether 
they are connected by an edge in the 
network. 

A measure of the preference for labeled 
nodes in a network to attach to other 
nodes with identical labels. This is the Pear- 
son correlation coefficient of the degrees of 


pairs of linked nodes. 


modularity 
modularity, 


max 


defined below and modularity max as the 
modularity of a perfectly mixed network. 
modularity nay = 75 (2m Dy et O(c; Su 
A centrality measure for a node in a graph. 
Precisely, this is the proportion of shortest 
paths between all possible pairs of nodes in a 
connected component that pass through 
this node. A betweenness close to 1 is indic- 
ative of a highly central gene, whereas close 
to 0 is more peripheral. 

A graph with two types of nodes (top and 
bottom nodes), in which an edge only con- 
nects nodes of different types. 

A group of entities that replicated separately 
but exploit common genetic material that 
may not trace back to the last common 
ancestor. 

In graph terminology, a community is 
defined as a group of nodes that are more 
connected between themselves than to 
nodes in the rest of the graph. 

A gene that is made up of at least two com- 
ponent parts. 

Genetic fragments sharing partial similarity 
to a composite gene. 

A measure that quantifies whether a given 
category of nodes shares more edges 
between themselves than with the rest of 
the nodes in the graph. A low conductance 
approaching zero implies that there are few 
edges shared between this category of 
nodes and the rest of the graph, while a 
higher conductance implies more connec- 
tivity between that category of nodes and 
other nodes outside of the category. G a 
graph, G = {V, E}. With U & Ga set of 
nodes that is assumed to not have more 
than half the total node. U = G\U. d(U) 


Assortativity = with modularity 
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Connected component 


Degree 
Endosymbiont 


Edge 
E-value 


Introgression 


Lateral gene transfer 
(LGT; or horizontal 
gene transfer, HGT) 
Louvain community 


Network (or graph) 


Multipartite graph 


Multiplex graph 


Modularity 


Phylogenomic network 
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sum of vertices in U. 


of degree 
Dieu, ies Dë 

min(d(U), d(U)) 

A subgraph in which any pair of nodes is 

connected, either directly or indirectly, and 

that is not connected to the rest of the 

graph. 

The number of edges connected to a 

given node. 

An organism that lives inside another to the 

mutual benefit of both organisms. 

The link between two nodes in a network. 

The number of alignments in a sequence 

similarity search expected to be seen by 

chance searching against a database of a 

certain size. 

Descent process through which the genetic 

material of an entity propagates into differ- 

ent host structures and is replicated within 

these new host structures. 

Movement of genetic material between 

entities not mediated by vertical descent. 


Conductance = 


A graph community identified using the 
Louvain algorithm. Louvain algorithm is 
based on optimizing modularity. 

A system of objects (nodes), some pairs of 
which are linked (edge). 

Similar to a bipartite graph, but with any 
number of types of nodes exclusively 
connected to nodes of other types. 

A graph where nodes can be connected by 
edges of different types. 

The fraction of edges falling within given 
groups (e.g., communities or functional 
categories) in a network, minus the fraction 
of edges that would be expected with a 
random distribution of edges. With m the 
total number of vertices, c; the community 
of node , 5() the Kronecker delta, and k; the 


degree of modularity 
= om gl Ay z Gilet = cj). 

A phylogenetic network constructed from 
whole genome sequences where genomes 
are connected based on pairwise relation- 
ships including vertical and lateral gene 
transfer (LGT) events. 
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Public genetic goods 
Quotient graph 
Supporting genomes 
Twins 
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Abstract 


Bayesian methods for molecular clock dating of species divergences have been greatly developed during the 
past decade. Advantages of the methods include the use of relaxed-clock models to describe evolutionary 
rate variation in the branches of a phylogenetic tree and the use of flexible fossil calibration densities to 
describe the uncertainty in node ages. The advent of next-generation sequencing technologies has led to a 
flood of genome-scale datasets for organisms belonging to all domains in the tree of life. Thus, a new era has 
begun where dating the tree of life using genome-scale data is now within reach. In this protocol, we explain 
how to use the computer program MCMCTree to perform Bayesian inference of divergence times using 
genome-scale datasets. We use a ten-species primate phylogeny, with a molecular alignment of over three 
million base pairs, as an exemplar on how to carry out the analysis. We pay particular attention to how to set 
up the analysis and the priors and how to diagnose the MCMC algorithm used to obtain the posterior 
estimates of divergence times and evolutionary rates. 


Key words Molecular clock, Bayesian analysis, MCMC, Fossil, Phylogeny, Primates, Genome 


1 Introduction 


The molecular clock hypothesis, which states that the rate of molec- 
ular evolution is approximately constant with time, provides a 
powerful way to estimate the times of divergence of species in a 
phylogeny. Since its proposal over 50 years ago [1], the molecular 
clock hypothesis has been used countless times to calibrate molec- 
ular phylogenies to geological time, with the ultimate aim of dating 
the tree of life [2, 3]. Several statistical inference methodologies 
have been developed for molecular clock dating analyses; however, 
during the past decade, the Bayesian method has emerged as the 
method of choice [4, 5], and several Bayesian inference software 
packages now exist to carry out this type of analysis [6-10]. 

In this protocol, we will explain how to use the computer 
program MCMCtTree to estimate times of species divergences 
using genome-scale datasets within the Bayesian inference 
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framework. Bayesian inference is well suited for divergence time 
estimation because it allows the natural integration of information 
from the fossil record (in the form of prior statistical distributions 
describing the ages of nodes in a phylogeny) with information from 
molecular sequences to estimate node ages, or geological times of 
divergence, of a species phylogeny [6, 11]. Another advantage of 
the Bayesian clock dating method is that relaxed-clock models, 
which allow for violations of the molecular clock, can be easily 
implemented as the prior on the evolutionary rates for the branches 
in the phylogeny [6]. MCMCtTree allows analyses to be carried out 
using two popular relaxed-clock models (the autocorrelated and 
independent log-normally distributed rates models [12, 13]), as 
well as under the strict molecular clock. Furthermore, MCMCTree 
allows the user to build flexible fossil calibrations based on various 
statistical distributions (such as the uniform, truncated-Cauchy, 
and skew-z, and skew-normal distributions [12, 14, 15]). But 
perhaps the main advantage of MCMCtTree is the implementation 
of an approximate algorithm to calculate the likelihood [6, 16], 
which allows the computer analysis of genome-scale datasets to be 
completed in reasonable amounts of time. The disadvantage of the 
algorithm is that it only works on fixed tree topologies. Several 
software packages that perform co-estimation of times and tree 
topology, but which do not use the approximation, are available 
[8, 9, 17, 18]. 

In this protocol, we focus on how to carry out a clock dating 
analysis with MCMCTree, paying particular attention to diagnos- 
ing the MCMC algorithm (the workhorse algorithm within the 
Bayesian method). Theoretical details of the Bayesian clock dating 
methods implemented in the program MCMCTree are described in 
[12—16, 19]. For general introductions to Bayesian statistics and 
Bayesian molecular clock dating, the reader may consult [20, 21]. 


2 Software and Data Files 


To run the protocol, you will need the MCMCTree and BASEML 
programs, which are part of the PAML software package for phylo- 
genetic analysis [22]. The source code and compiled versions of the 
code are freely available from bit.ly/ziheng-paml. All the data files 
necessary to run the protocol can be obtained from github.com/ 
mariodosreis/divtime. Please create a directory called divtime in 
your computer and download all the data files from the GitHub 
repository. This protocol was tested with PAML version 4.9e. 

You are assumed to have basic knowledge of the command line 
in Unix or Windows (also known as command prompt, shell, or 
terminal). Simple tutorials for users of Windows, Mac OS, and 
Linux are posted at bit.ly/ziheng-software. Install MCMCTree 
and BASEML in your computer system, and make sure you have 


2.1 Tree and Fossil 
Calibrations 


2.2 Molecular 
Sequence Data 


3 Tutorial 
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the mcmctree and baseml executables in your system’s path (see 
bit.ly/ziheng-paml for details on how to do this). Finally, it is 
helpful (but not indispensable) to have knowledge of the R statisti- 
cal environment (www.r-project.org). R is quite useful to analyze 
the output of the program, perform convergence diagnostics, and 
create nice-looking plots. File R/analysis.R contains some 
examples for this tutorial. 

In this protocol, we will estimate the divergence times of nine 
primates and one scandentian (an out-group), using a very long 
alignment (over three million nucleotides long). This dataset was 
chosen because it can be analyzed very quickly with MCMCTree 
and it is thus suitable to illustrate the method. We also provide a 
dataset of 330 species (276 primates and 4 out-groups) with a 
shorter alignment, to illustrate time estimation in a taxon-rich 
dataset (see Sect. 5.5 for details). 


The phylogenetic tree of the ten species is shown in Fig. 1. The tree 
encompasses members of all the main primate lineages. The ten 
species were chosen because they have had their complete genomes 
sequenced. They are a subset of the 36 mammal species analyzed in 
[23]. File data/10s.tree contains the tree with fossil calibrations 
in Newick format, which is the format required by MCMCTree. 
The eight fossil calibrations are shown in Table 1. The calibrations 
are the same used to estimate primate divergence times in [24]. We 
discuss fossil calibrations in detail in the “Sampling from the Prior” 
section. The time unit in the analysis is 100 million years (My). 
Thus, the calibration B(0.075, 0.10) means the node age is con- 
strained to be between 7.5 and 10 million years ago (Ma). 


The molecular data are an alignment of 5614 protein-coding genes 
from the ten species. All ambiguous codon sites were removed, and 
thus the alignment contains no missing data. The alignment was 
separated into two partitions: A partition consisting of all the first 
and second codon positions (2,253,316 nucleotides long) and a 
partition of third codon positions (1,126,658 nucleotides long). 
The alignment is a subset of the larger 36-mammal-species align- 
ment in [23]. See also ref. 24. File 10s. phys in the data directory 
contains the alignment. The alignment is compressed into site 
patterns (a site pattern is a unique combination of character states 
in an alignment column) to save disk space. 


We seek to obtain the posterior distribution (i.e., the estimates) of 
the divergence times (t) and the molecular evolutionary rates (r, H. 
o°) for the species in the phylogeny of Fig. 1. Here t = (t11, ..., Dal 
are the nine species divergence times; r = (71,12, - - - 71,19) 12,125 -+ > 
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Threeshrew 
Bushbaby 
11 Mouse lemur 
Tarsier 
Marmoset 
Rhesus 
Orangutang 
Gorilla 
Chimp 
19 Human 
Fig. 1 The tree of ten species. Nodes with fossil calibrations are indicated with 
black dots (see Table 1 for calibration densities). Internal nodes are numbered 
from 11 to 19 according to the nomenclature used by MCMCTree 
Table 1 
List of fossil calibrations used in this tutorial 
Node? Crown group MCMCTree calibration” 
19 Chimp-human B(0.075, 0.10, 0.01, 0.20) 
18 Gorilla-human B(0.10, 0.132, 0.01, 0.20) 
17 Hominidae B(0.112, 0.28, 0.01, 0.10) 
16 Catarrhini B(0.25, 0.29, 0.01, 0.10) 
15 Anthropoidea ST(0.4754, 0.0632, 0.98, 22.85) 
13 Strepsirrhini B(0.38, 0.58, 0.01, 0.10) 
12 Primates S2N(0.698, 0.65, 0.0365, —3400, 0.650, 0.138, 11409) 
ll Euarchonta G(36, 36.9) 


“Node numbers as in Fig. 1 


PB(a, b, pi, Pu) means the calibration is a uniform distribution between a and b, with probabilities py and py that the true 
node age is outside the calibration bounds. ST(location, scale, shape, df’) means the calibration is a skew-¢ distribution. 
S2N( p, location], scalel, shapel, location2, scale2, shape2) means the calibration is a pl — p mixture of two skew- 
normal distributions. Glo, #) means the calibration is a gamma distribution with shape a and rate J. See MCMCTree’s 


manual for the full details on fossil calibration formats. The calibrations are from the primate analysis in [24] 


7,19) are the 2 x 8 = 16 molecular rates, one per branch and 
partition (i.e., there are eight branches in the tree and two parti- 
tions in the molecular data); and u = (41, #2) and el = (oi. 65) are 
the mean rates and the log-variance of the rates, for each partition. 


The posterior distribution is 


F(t r, u, |D) x FO (elt u, P) FFP) Dlr, t), 
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where f(t) is the prior on times; f(r|t, y, o° ui eiis the prior on 
the branch rates, mean rates, and variances of the log-rates; and 
H Dit r) is the molecular sequence likelihood. The prior on the 
times is constructed by combining the birth-death process with the 
fossil calibration densities (see ref. 13 for details). The prior on the 
rates is constructed under a model of rate evolution, assuming, in this 
tutorial, that the branch rates are independent draws from a 
log-normal distribution with mean yx; and log-variance oi [13]. 
Bayesian phylogenetic inference using MCMC is computation- 
ally expensive because of the repeated calculation of the likelihood 
on a sequence alignment. The time it takes to compute the likeli- 
hood is proportional to the number of site patterns in the align- 
ment. Thus, longer alignments take longer to compute. For 
genome-scale alignments, the computation time is prohibitive. 
MCMCtTree implements an approximation to the likelihood 
that speeds computation time substantially, making analysis of 
genome-scale data feasible. The approximate likelihood method 
for clock dating was proposed by Thorne et al. [6] and extended 
within MCMCTree [16]. The method relies on approximating the 
log-likelihood surface on the branch lengths by its Taylor expan- 
sion. Write €(b,;) = log g D b;) for the log-likelihood as a function 
of the branch lengths b; = (0; = 7; ;t;) for the alignment partition 1 


Jot 
The Taylor approximation is 


A P l ~ \T o 
€(b;) © C(b;) + (b; — bj) g, + z (b; —b,;) H;(b;- b;), 


where b; are the maximum likelihood estimates (MLEs) of the 
branch lengths and g; and H; are the gradient (vector of first 
derivatives) and Hessian (matrix of second derivatives) of the 
log-likelihood surface evaluated at the MLEs for the partition. 
The approximation can be improved by applying transformations 
to the branch lengths (see ref. 16 for details). 

To use the approximation, one first fixes the topology of the 
phylogeny, and then estimates the branch lengths for each align- 
ment partition on the fixed tree by maximum likelihood. The 
gradient and Hessian of the log-likelihood are obtained for each 
partition at the same time as the MLEs of the branch lengths. Note 
that parameters of the substitution model—such as the transition/ 
transversion ratio, x, in the HEN model or the o parameter in the 
discrete gamma model of rate variation among sites—are estimated 
at this step. Thus, different substitution models will generate dif- 
ferent approximations, because they will have different MLEs for 
the branch lengths, gradient, and Hessian. Note that the time it 
takes to compute the approximate likelihood depends only on the 
number of species (which determines the size of b and H) and not 
on the alignment length, that is, once g and H have been calcu- 
lated, MCMC sampling on the approximation takes the same time 
regardless of the length of the original alignment. 
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3.1 Overview 


3.2 Calculation of the 
Gradient and Hessian 
to Approximate the 
Likelihood 


We will use the approximate likelihood method to speed up the 
computation of the likelihood on the large genome alignment. The 
general strategy for the analysis is as follows: 


l. Approximate likelihood calculation: First, we will calculate the 
gradient (g) and Hessian (H) matrix of the branch lengths on 
the unrooted tree. For this step, we will need to use the 
MCMCTree and BASEML programs (BASEML will carry 
out the actual computation of g and H). The substitution 
model is chosen at this step. 


2. MCMC sampling from the posterior. Once g and H have been 
calculated and we have decided on our priors, we can use 
MCMCtTree to perform MCMC sampling from the posterior 
distribution of times and rates. We will then look at the sum- 
maries of the posterior (such as posterior mean times and rates 
and 95% credibility intervals). 


3. Convergence diagnostics: The MCMC algorithm is a stochastic 
algorithm that visits regions of the parameter space in propor- 
tion to the posterior distribution. Due to its very nature, it is 
possible that sometimes the MCMC chain is terminated before 
it has had a chance to explore the parameter space appropri- 
ately. The way to guard against this is to run the analysis two or 
more times and compare the summary statistics from the two 
(or more) MCMC chains. If the results from different runs are 
very similar, then convergence to the posterior distribution can 
be reasonably assumed. 


4. MCMC sampling from the prior: Finally, we will sample directly 
from the prior of times and rates. This is particularly important 
in Bayesian molecular clock dating because in most cases the 
prior on times may look quite different from the fossil calibra- 
tion densities specified by the user. Thus, sampling from the 
prior allows the user to check the soundness of the prior 
actually used. 


Note that in this protocol we assume the user has chosen a 
suitable sequence alignment and a phylogenetic tree to carry out 
the analysis. For genome-scale alignments, it is important that the 
genes chosen among the various species are orthologous and that 
the alignment has been checked for accuracy. Several chapters in 
this volume can guide the user in this purpose. 


Go into the gH directory, and open the mcmctree-outBV. ctl file 
using your favorite text editor. This control file contains the set of 
parameters necessary for MCMCTree to carry out the calculations 
of the gradient and Hessian needed for the approximate likelihood 
method. Figure 2 shows the contents of the mcmctree-outBV. 
ctl file. 
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seqfile = ../data/10s.phys 
treefile = ../data/10s.tree 


ndata = 
seqtype : nucleotides; l:codons; 2:AAs 
usedata = : no data (prior); l:exact likelihood; 
approximate likelihood; 3:out.BV (in.BV) 


clock : global clock; 2: independent rates; 3: correlated rates 
model = 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85 
alpha 3 * alpha for gamma rates at sites 


ncatG = * No. categories in discrete gamma 


cleandata * remove sites with ambiguity data (l:yes, O:no)? 


Fig. 2 The gH/mcmctree-outBV. ct file, with appropriate options to set up calculation of the gradient 
and Hessian matrix for the approximate likelihood method 


10 
((Bushbaby: 0.029523, Mouse lemur: 0.019653): 0.006547, (Tarsier: 0.030897, (Marmoset: 0.0 
0.006547 0.029523 0.019653 0.002123 0.030897 0.011754 0.015183 0.003426 0.008716 


-2.114230 -2.618861 21.299836 31.765175 20.801006 -3.019251 -14.909946 8.188538 -3.70464 


Hessian 


-2.033e+08 -2.59e+06 -9.717e+06 .363e+07 1.799e+06 .457e+06 
-2.59e+06 -5.71e+07 2.235e+06 .475e+06 3.315e+06 -651e+06 


-055e+06 -1.29e+04 
-436e+06 2.134e+06 


-4.363e+07 
1.799e+06 
-5.457e+06 
2.055e+06 
-1.29e+04 
3.483e+06 
8.344e+05 
3.625e+06 
2.701e+06 


-475e+06 -2.954e+06 -622e+08 -5.059e+06 -2.658e+07 
-315e+06 2.79e+06 -059e+06 -5.473e+07 -951e+05 -437e+06 2.28e+06 
-651e+06 -275e+05 -658e+07 7.951e+05 -403e+08 -724e+06 -1.163e+07 
-436e+06 -371e+06 .701e+06 3.437e+06 3.724e+06 -1.25e+08 -1.69e+07 
.134e+06 .512e+06 «1576+06 2.28e+06 .163e+07 -1.69e+07 -4.756e+08 
.548e+06 .413e+06 .406e+05 4.463e+06 .246e+06 1.979e+06 1.698e+06 
.861e+06 .023e+06 .605e+06 2.021e+06 .676e+05 -8.424e+05 -1.722e+07 
.671e+06 .894e+06 -939e+05 4.775e+06 -595e+06 1.699e+06 5.407e+05 
-036e+06 -394e+06 -777e+06 3.175e+06 .217e+05 -5.952e+05 -4.592e+06 


-701e+06 -5.157e+06 


2 
E 
-9.717e+06 2.235e+06 -8.733e+07 -954e+06 2.79e+06 -275e+05 3.371e+06 1.512e+06 
3 
3 
3 


GA äs H oO WH 


Fig. 3 The gH/out . BV file produced by BASEML. The first line has the number of species (10), the second 
line has the tree topology with MLEs of branch lengths, and the MLEs of branch lengths are given again in the 
third line. The fourth line contains the gradient, g, followed by the Hessian, H, for partition 1. This file will be 
renamed in. BV and placed into the memc / directory to carry out MCMC sampling using the approximate 
likelihood method 


The first two items, seqfile and treefile, indicate the 
alignment and tree files to be used. The third item, ndata, indi- 
cates the number of partitions in the sequence file, in this case, two 
partitions. The fifth item, usedata, is very important, as it tells 
MCMCtTree the type of analysis being carried out. The options are 
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Control File and 


0, to sample from the prior; 1, to sample from the posterior using 
exact likelihood; 2, to sample from the posterior using approximate 
likelihood; and 3, to prepare the data for calculation of g and H. 
The last is the option we will be using in this step. The next three 
items, model, alpha, and ncatG, set up the nucleotide substitu- 
tion model, in this case the HKY + Gamma model [25 ]. Finally, the 
cleandata option tells MCMCTree whether to remove ambigu- 
ous data. Our alignment has no ambiguous sites, so this option has 
no effect in this case. 
Using a terminal, go to the gH directory and type 


$ memctree mcemctree-outBV.ctl 


(Don’t type in the $ as this represents the command prompt! ) 
This will start the MCMCTree program. MCMCTree will prepare 


estimate g and H. For this step to work correctly, the baseml 
executable must be in your system’s path. Once BASEML and 
MCMCtTree have finished, you will notice a file called out.BV 
has been created. Figure 3 shows part of the contents of this file. 
The first line indicates the number of species (10), followed by the 
tree with branch lengths estimated under maximum likelihood for 
the first partition (first and second codon sites). Next, we have the 
MLEs of the 17 branch lengths (these are the same as in the tree but 
printed in a different order). Then we have the gradient, g), the 
vector of 17 first derivatives of the likelihood at the branch length 
MLEs for partition 1. For small datasets, the gradient is usually 
zero. For large datasets, the likelihood surface is too sharp (De, 
bends downward sharply and it is very narrow at the MLEs), and 
the gradient is not zero for numerical issues. But this is fine. Next, 
we have the 17 x 17 Hessian matrix, H}, the matrix of second 
derivatives of the likelihood at the branch length MLEs for parti- 
tion 1. If you scroll down the file, you will find the second block, 
with the tree, branch length MLEs, g2, and H, for partition 
2 (third codon positions). 


Now that we have calculated g and H, we can proceed to MCMC 
sampling of the posterior distribution using the approximate likeli- 
hood method. Copy the gH/out . BV file into the meme directory, 
and rename it as in. BV. Now go into the mcmc directory. There 
you will find mcmctree.ctl, the necessary MCMCTree control 
file to carry out MCMC sampling from the posterior. Figure 4 
shows the contents of the file. The first item, seed, is the seed for 
the random number generator used by the MCMC algorithm. 
Here it is set to —1, which tells MCMCTree to use the system’s 
clock time as the seed. This is useful, as running the program 
multiple times will generate different outputs. 


seed = 


seqfile 
treefile 
memcfile 


outfile = 


ndata = 


seqtype 


usedata = 


clock = 
RootAge = 


model = 


alpha 


ncatG = 


cleandata 


BDparas = 


kappa_gamma 


alpha _ gamma = 


rgene gamma = 


sigma2_gamma 


print = 


burnin 
sampfreg 


nsample = 


-1 


Molecular Clock Dating 317 


../data/10s.phys 
../data/10s.tree 
memc.txt 

out.txt 


1 
2 
1 


00 


0 


40 1 
10 1 


* 


20000 


20000 


1:codons; 2:AAs 
: no data (prior); l:exact likelihood; 

rapproximate likelihood; 3:out.BV (in.BV) 
: global clock; 2: independent rates; 3: 
* safe constraint on root age, 


: nucleotides; 


correlated rates 
used if no fossil for root. 


* 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85 
* alpha for gamma rates at sites 

* No. categories in discrete gamma 
* remove sites with ambiguity data (l:yes, O:no)? 
* birth, death, sampling 

* gamma prior for kappa 

* gamma prior for alpha 


* gammaDir prior 
* gammaDir prior 


for rate for genes 
for sigma^2 (for clock=2 or 3) 
0: no mcmc sample; 1: 


everything except branch rates 2: everything 


Fig. 4 The mcmc/mcmctree.ct1 file necessary to sample from the posterior distribution using the 
approximate likelinood method 


The mcmcfile option tells MCMCTree where to save the 
parameters sampled (divergence times and rates) during the 
MCMC iterations. Here we will save them to a file named mcmc. 
txt. Once the MCMC sampling has completed, MCMCTree will 
read the sample from the memc .t xt file and generate a summary of 
the MCMC output. This summary will be saved to a file called 
out.txt (outfile option). 

The option usedata is set to 2 here, which tells MCMCTree 
to calculate the likelihood approximately by using the g and 
H values saved in the in.BV file. Option clock sets the clock 
model. Here we use clock = 2, which assumes rates are identical, 
independent realizations from a log-normal distribution 
[7, 26]. Option RootAge sets the calibration on the root node of 
the phylogeny, if none are present in the tree file. In our case, we 
already have a calibration on the root, so this option has no effect. 
The next three options, model, alpha, and ncatG, have no effect 
as the substitution model was chosen during estimation of g and H. 

The following options are very important as they determine the 
prior used in the analysis. BDparams sets the prior on node ages for 
those nodes without fossil calibrations by using the birth-death 
process [12]. Here we use 1 1 0, which means node ages are 
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3.3.2 Running and 
Summarizing the MCMC 


uniformly distributed between present time and the age of the root. 
Options kappa_gamma and alpha_gamma set gamma priors for 
the x and a parameters in the substitution model. These have no 
effect as we are using the likelihood approximation. Options rge- 
ne_gamma and sigma2_gamma set the gamma-Dirichlet prior on 
the mean substitution rate for partitions and for the rate variance 
parameter, o° [19]. The prior on the mean rate is Gamma(2, 40), 
which has mean 0.05 substitutions per time 100 My. A symmetric 
Dirichlet distribution with concentration parameter equal to 1 is 
used to spread the rate prior across partitions (thus rgene_gamma 
= 2 40 1). See ref. 19 for details. The prior on o? is Gamma(1, 10) 
which has mean 0.1. A Dirichlet is also used to spread the prior 
across partitions. 

The final block of options, print, burnin, sampfreq, and 
nsample, control the length and sampling frequency of the 
MCMC. We will discard the first 20,000 iterations as the burn-in 
and then print parameter values to the mcmc.txt file every 
100 iterations, to a maximum of 20,000 + 1 samples. Thus, our 
MCMC chain will run for a total of 20,000 + 20,000 x 100 = 
2,020,000 iterations. 


Go into the mcmc directory and type 


$ memctree memctree.ctl 


This will start the MCMC sampling. First, MCMCtTree will 
iterate the chain for a set number of iterations, known as the burn- 
in. During this period, the program will fine-tune the step sizes for 
proposing parameters in the chain. Once the burn-in is finished, 
sampling from the posterior will start. Figure 5 shows a screenshot 
of MCMCtTree in action. The leftmost column indicates the prog- 
ress of the sampling as a percentage of the total (5%, 10% of total 
iterations, and so on). The next numbers represent the acceptance 
proportions, which are close to 30% (this is the result of fine-tuning 
by the program). After the five acceptance proportions, the pro- 
grams prints a few parameters to the screen and in the last columns 
the log-likelihood and the time taken. 

The above analysis takes about 2 min and 30 s to complete on a 
2.2 GHz Intel Core i7 Processor. Once the analysis has finished, 
you will see that MCMCTree has created several new files in the 
meme directory. Rename mcmc .txt to memcl.txt and out.txt 
to out1.txt. Now, on the command line, type again 


$ memctree memctree.ctl 
This will run the analysis a second time. The results should be 


slightly different to the previous run due to the stochastic nature of 
the algorithm. Once the second run has finished, rename mcmc. 
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0% 0.26 0.39 0.23 0.39 0.28 1.285 1.243 0.588 1.158 0.541 0.321 - 0.192 0.197 -16.9 0:02 


(nsteps = 50) 

Current Pjump: 0.26200 0.39475 0.23175 0.38650 0.28000 0.27550 0.39200 0.43750 
0.40100 0.29725 7 «21325 «32275 -23475 23150 29875 . 31600 27800 
0.25300 ~29975 ` 132515 -27500 -61150 -29850 -31225 .35400 23200 
.30800 28250 r w213Z5 -22700 -25900 26725 .26900 . 33150 23925 
21000 0.20700 r -61625 30675 -30150 . 32000 21975 -27650 22500 
-36650 0.00000 
urrent finetune: 0.00166 0.00586 0.00182 0.00503 -00697 0.00486 0.00500 
-00835 0.24230 e «11942 -65595 01093 „01230 .01256 .00960 01492 
.02008 .02466 A «03942 .04624 Disk .02425 .04971 01513 03626 
-03661 .04475 : , 00867 -00949 -01146 -00861 -01133 01263 02252 
-02728 0.03996 y 14736 02025 -04584 -01209 s .02776 03389 
.05173 0.00000 
ew finetune: 0.00313 0.00232 0.00438 0.00248 0.00465 r 0.00675 0.00806 
-01194 0.23972 24532 - 65158 -71499 00829 -00918 7 -01020 0.01367 
.01654 02463 03499 -04345 -04183 47928 02411 ; .01846 0.02714 
-03776 .04175 .09064 00592 .00694 .00969 -00755 : -01422 0.01728 
02835 .02644 WEEK .42023 -02079 .04611 :01305 ' .02527 0.02454 
-06589 .00000 
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Fig. 5 Screenshot of MCMCTree’s output during MCMC sampling of the posterior. Different runs of the 
program will give slightly different output values 


txt to mcmc2.txt and out.txt to out2.txt. If you want to 
conduct two runs simultaneously, you can create two directories 
(say r1/ and r2/) and copy the necessary files into them. Then 
open two terminal windows to start the runs from within each 
directory. 

Using your favorite text editor, open file out1.txt, which 
contains the summary of the first MCMC run. Scroll to the end 
of the file (see screenshot, Fig. 6). You will see the time used by the 
program (in my case 2:32), the posterior means of the parameters 
sampled, and three phylogenetic trees in Newick format. The first 
tree simply has internal nodes labelled with a number. This is useful 
to compare the tree with the posterior means of times at the end of 
the file. The second tree is the tree with branch lengths in absolute 
time units. The third tree is like the second by including the 95% 
credibility intervals (CIs) of the node ages. At the bottom of the 
file, you have a table with all the divergence times (from t_n11 to 
t_n19), the mean substitution rates for the two partitions (mul 
and mu2), the rate variation coefficients (sigma2_1 and 
sigma2_2), and finally the log-likelihood (1nL). The table gives 
the posterior means, equal-tail CIs, and high-posterior-density CIs. 
For example, the posterior age of the root (node 11, Fig. 1) is 
116.8 Ma (95% CI, 144.2-92.4 Ma) while for the divergence 
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ln Lmax (unconstrained) = -4636133.236961 
Time used: 2:26 


mean of parameters using all iterations 
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Fig. 6 The end of the memc /out . txt file produced by MCMCTree at the end of the MCMC sampling of the 
posterior 


between human and chimp (node 19, Fig. 1) is 8.52 Ma (95% CI, 
7.58-9.81 Ma). 

You will also notice that MCMCTree created a file called Fig- 
Tree.tre. This contains the posterior tree in Nexus format, suit- 
able for plotting in the program FigTree (tree.bio.ed.ac.uk/ 
software/figtree/). Figure 7 shows the posterior tree plotted in 
FigTree, with the time unit set to 1 My. 


3.4 Convergence Diagnosing convergence of the MCMC chains is extremely impor- 
Diagnostics of tant. Several software tools have been written for this purpose. For 
the MCMC example, the user-friendly Tracer program (beast.bio.ed.ac.uk/ 


tracer) can be used to read in the mcmc1l.txt and mcmc2.txt 
files and calculate several convergence statistics. Here we will 
use R to perform basic convergence tests (check out file R/analy- 
sis.R). 

The first step to assess convergence is to compare the posterior 
means among the different runs. You can visually inspect the pos- 
terior means reported in the out1.txt and out2.txt files 
(Fig. 8), although this may be cumbersome. Figure 8a shows a 
plot, made with R, of posterior times for run 1 vs. those from run 
2. You can see that the points fall almost perfectly on the y= x line, 
indicating that both runs have converged to the same distribution 
(hopefully the posterior! ). 
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Fig. 7 The dated primate phylogeny with error bars (representing 95% Cls of node ages), drawn with FigTree. 


The time unit is 1 My 


Another useful statistic to be calculated is the effective sample 
size (ESS). This gives the user an idea about whether an MCMC 
chain has been run long enough. Tracer calculates ESS automati- 
cally for all parameters. Function coda::effectiveSize in R 
will do the same. Figure 9 shows the posterior mean, ESS, posterior 
variance, and standard error of posterior means calculated with R 
for run 1 of the MCMC. The longer the ESS, the better. As a rule of 
thumb, one should seek ESS larger than 1000, although this may 
not always be practical in phylogenetic analysis. Note in Fig. 9 that 
some estimates have very low ESSs, while others have substantially 
higher ESSs. For example, t_n11 has ESS = 76.1, while t_n19 has 
ESS = 1261. Running the analysis again and increasing the total 
number of iterations (e.g., by increasing samplefreg or nsam- 
ple) will lead to higher ESS values for all parameters. 

Let v be the posterior variance of a parameter. The standard 
error of the posterior mean of the parameter is S.E. = V(»/ESS). 
This is why having large ESS is important: Large ESS leads to small 
S.E. and better estimates of the posterior mean. For example, for 
t_nll, the posterior mean is 116.8 Ma, with standard error 
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Fig. 8 Convergence diagnostic plots of the MCMC drawn with R (see R/analysis-R) 
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Fig. 9 Calculations of posterior mean, ESS, posterior variance, and standard error of the posterior mean in R 
(see R/analysis.R) 


3.5 MCMC Sampling 
from the Prior 
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1.53 My (Fig. 9). That is, we have estimated the mean accurately to 
within 2 x 1.53 My = 3.06 My. To reduce the S.E. by half, you 
need to increase the ESS four times. Note that independent 
MCMC runs can be combined into a single run. Thus, you may 
save time by running several MCMC chains in parallel for compu- 
tationally expensive analyses, although care must be taken to ensure 
each chain has run long enough to exit the burn-in phase and 
explore the posterior appropriately. 

Trace plots and histograms are useful to spot problems and 
check convergence. Figure 8b, c shows trace plots for t_n19 and 
t_n11, respectively. The trace of t_n19, which has high ESS, looks 
like a “hairy caterpillar.” Compare it to the trace of t_n11, which 
has low ESS. Visual inspection of a trace plot usually gives a sense of 
whether the parameter has an adequate ESS without calculating 
it. Note that both traces are trendless, that is, the traces oscillate 
around a mean value (the posterior mean). If you see a persistent 
trend in the trace (such as an increase or a decrease), that most likely 
means the MCMC did not converge to the posterior and needs a 
longer burn-in period. 

Figure 8d shows the smoothed histograms (calculated using 
density in R) for t_n11 for the two runs. Notice that the two 
histograms are slightly different. As the ESS becomes larger, histo- 
grams for different runs will converge in shape until becoming 
indistinguishable. If you see large discrepancies between histo- 
grams, that may indicate serious problems with the MCMC, such 
as lack of convergence due to short burn-in or the MCMC getting 
stuck in different modes of a multimodal posterior. 


Note that fossil calibrations (such as those of Table 1) are repre- 
sented as statistical distributions of node ages. MCMCtTree uses 
these distributions to construct the prior on times. However, the 
resulting time prior used by the program may be substantially 
different from the original fossil calibrations, because the program 
applies a truncation so that daughter nodes are younger than their 
ancestors [14, 27]. Thus, it is advisable to calculate the time prior 
explicitly by running the MCMC with no data so that it can be 
examined and compared with the fossil calibrations and the 
posterior. 
Go to the prior directory and type 


$ memctree memctree-pr.ctl 


This will start the MCMC sampling from the prior. File 
mcemctree-pr.ctl is identical to mcmc/memctree.ctl except 
that option usedata has been set to 0. Sampling from the prior is 
much quicker because the likelihood does not need to be calcu- 
lated. It takes about 1 min on the Intel Core i7 for MCMCTree to 
complete the analysis. Rename files mcmc.txt and out.txt to 
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Fig. 10 Prior (gray) and posterior (black) density plots of node ages plotted with R (see R/analysis.R) 


mcmc1.txt and out1.txt, and run the analysis again. Rename 
the new files as appropriate. Check for convergence by calculating 
the ESS and plotting the traces and histograms. 

Figure 10 shows the prior densities of node ages obtained by 
MCMC sampling (shown in gray) vs. the posterior densities 
(shown in black). Notice that for four nodes t_n19, t_n18, 
t_n17, and t_n16, the posterior times “agree” with the prior, 
that is, the posterior density is contained within the prior density. 
For nodes t_n15, t_n13, and t_n11, there is some conflict 
between the prior and posterior densities. However, for nodes 
t_n14 and t_n12, there is substantial conflict between the prior 
and the posterior. In both cases the molecular data (together with 
the clock model) suggest the node age is much older than that 
implied by the calibrations. This highlights the problems in con- 
struction of fossil calibrations. 

Each fossil calibration represents the paleontologist’s best guess 
about the age of a node. For example, the calibration for the 
human-chimp ancestor is B(0.075, 0.10, 0.01, 0.20); thus, the 
calibration is a uniform distribution between 7.5 and 10 million 
years ago (Ma). The bounds of the calibration are soft, that is, there 
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is a set probability that the bound is violated. In this case the 
probabilities are 1% for the minimum bound and 20% for the 
maximum bound. The bound probabilities are asymmetrical 
because they reflect the nature of the fossil information. Minimum 
bounds are usually set with confidence because they are based on 
the age of the oldest fossil member of a clade. For example, the 
minimum of 7.5 Ma is based on the age of f Sahelanthropus tcha- 
densis, recognized as the oldest fossil within the human lineage 
[28]. On the other hand, establishing maximum bounds is difficult, 
as absence of fossils for certain clades cannot be interpreted as 
evidence that the clade in question did not exist during a particular 
geological time [29]. Our maximum here of 10 Ma represents the 
paleontologist’s informed guess about the likely oldest age of the 
clade; however, a large probability of 20% is given to allow for the 
fact that the node age could be older. The conflict between the 
prior and posterior seen in Fig. 10 evidences this. 

Note that when constructing the time prior, the Bayesian dat- 
ing software must respect the constraints whereby daughter nodes 
must be younger than their parents. This means that calibration 
densities are truncated to accommodate the constraint, with the 
result that the actual prior used on node ages can be substantially 
different to the calibration density used (see Sect. 5.4). Detailed 
analyses of the interactions between fossil calibrations and the time 
prior and the effect of truncation are given in [14, 27]. 


4 General Recommendations for Bayesian Clock Dating 


4.1 Taxon Sampling, 
Data Partitioning, and 
Estimation of Tree 
Topology 


Extensive reviews of best practice in Bayesian clock dating are given 
elsewhere [4, 20, 21, 30, 31]. Here we give a few brief 
recommendations. 


In this tutorial we used a small phylogeny to illustrate Bayesian time 
estimation using approximate likelihood calculation. In practical 
data analysis, it may be desirable to analyze much larger phylogenies 
(see Sect. 5.5). In large phylogenies, there may be uncertainties in 
the relationships of some groups. The approximate method dis- 
cussed here can only be applied to a fixed (known) tree topology. If 
the uncertainties in the tree are few so that just a handful of tree 
topologies appear reasonable, the approximate method can be used 
by analyzing each topology separately [23, 32]. This involves esti- 
mating g and H for each topology and then running separate 
MCMC chains on each topology to estimate the times. Several 
methods to co-estimate divergence times and tree topology are 
available [8, 9, 17, 18], although they do not implement the 
approximate likelihood method and are thus unsuitable for the 
analysis of genome-scale datasets. 
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4.2 Selection of 
Fossil Calibrations 


We note that partitioning of sites in genomic datasets may have 
important effects on divergence time estimation. The infinite-sites 
theory [13, 33] studies the asymptotic behavior of the posterior 
distribution of times when the amount of molecular data (measured 
by the number of partitions and the number of sites per partition) 
increases in a relaxed-clock dating analysis. This theory shows that 
increasing the number of sites per partition will have minimal 
effects on time estimation when the sequences per partition are 
moderately long (>1000 sites, say), but the precision improves 
when the number of partitions increases, eventually approximating 
a limit when the number of partitions is infinite. The theory also 
predicts that very different time estimates may be obtained if the 
same genomic sequence alignment is analyzed as one partition or as 
multiple partitions [34]. Furthermore, while more partitions tend 
to produce more precise time estimates, with narrow CIs, they may 
not necessarily be more reliable, depending on the correctness of 
the fossil calibrations and the appropriateness of the partitioning 
strategies. Unfortunately it is hard to decide on a good partitioning 
strategy given the genome-scale sequence data, despite efforts to 
design automatic partitioning strategies for phylogenetic analysis 
and divergence time estimation [34-36]. Commonly used 
approaches partition sites in the alignment by codon position or 
by protein-coding genes of different relative rates [23]. We recom- 
mend the use of the infinite-sites plot [14], in which uncertainty in 
divergence time estimates (measured as the CI width) is plotted 
against the posterior mean of times. If the scatter points fall on a 
straight line, information due to the molecular sequence data has 
reached saturation, and uncertainty in time estimate is predomi- 
nantly due to uncertainties in fossil calibrations. 


Fossil calibrations are one of the most important pieces of informa- 
tion needed to perform divergence time estimation and thus should 
be chosen after careful consideration of the fossil record, although 
this may involve some subjectivity [29]. Parham et al. [30] discuss 
best practice for construction of fossil calibrations. For example, 
minimum bounds on node ages are normally set to be the age of the 
oldest fossil member of the crown group. A small probability (say 
2.5%) should be set for the probability that the node age violates 
the minimum bound (e.g., to guard against misidentified or incor- 
rectly dated fossils). Specifying maximum bounds is more difficult, 
as absence of fossils for a given geological period is not evidence 
that the clade in question was absent during the period [31]. Cur- 
rent practice is to set the maximum bound to a reasonable value 
according to the expertise of the paleontologist (see ref. 29 for 
examples), although a large probability (say 10% or even 20%) 
may be required to guard against badly specified maximum bounds. 
Calibration densities based on statistical modeling of species diver- 
sification, fossil preservation, and discovery are also possible 


4.3 Construction of 
the Time Prior 


4.4 Selection of the 
Clock Model 


5 Exercises 


5.1 Autocorrelated 
Rate Model 
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[15]. In so-called tip-dating approaches, fossil species are included 
as taxa in the analysis (which may or may not include morphological 
information for the fossil and extant taxa) [37-39]. Thus, in 
tip-dating, explicit specification of a fossil calibration density for a 
node age is not necessary. 


The birth-death process with species sampling was used here to 
construct the time prior for nodes in the phylogeny for which fossil 
calibrations are not available. Varying the birth (ø), death (A), and 
sampling (p), parameters can result in substantially different time 
priors. For example, using u = 2 = 1 and p = 0 leads to a uniform 
distribution prior on node ages. This diffuse prior appears appro- 
priate for most analyses. Varying the values of yw, 4, and p is useful to 
assess whether the time estimates are robust to the time prior. 
Parameter configurations can be set up to generate time densities 
that result in young node ages or in very old node ages (see p. 381 in 
[20] for examples). 


In analysis of closely related species (such as the apes), the clock 
assumption appears to be appropriate for time estimation. A likeli- 
hood ratio test can be used to determine whether the strict clock is 
appropriate for a given dataset [40]. If the clock is rejected, then 
Bayesian molecular clock dating should proceed using one of the 
various relaxed-clock models available [7, 13]. In this case, Bayesian 
model selection may be used to choose the most appropriate 
relaxed-clock model [41], although the method is computationally 
expensive and thus only applicable to small datasets. The use of 
different relaxed-clock models (such as the autocorrelated vs. the 
independent log-normally distributed rates) may result in substan- 
tially different time estimates (see ref. 32 for an example). In such 
cases, repeating the analysis under the different clock models may 
be desirable. 


Modify file mcmc/memctree.ctland set clock= 3. This activates 
the autocorrelated log-normal rates model, also known as the 
geometric Brownian motion rates model [6, 13]. Run the 
MCMC twice and check for convergence. Compare the posterior 
times obtained with those obtained under the independent 
log-normal model (clock = 2). Are there any systematic differences 
in node age estimates between the two analyses? Which clock model 
produces the most precise (De, narrower CIs) divergence time 
estimates? 
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5.2 MCMC Sampling 
with Exact Likelihood 
Calculation 


5.3 Change of Fossil 
Calibrations 


5.4 Comparing 
Calibration Densities 
and Prior Densities 


5.5 Time Estimation 
in a Supermatrix of 
330 Species 


Modify file memc/memctree.ctl and set clock = 2 (independent 
rates), usedata=1 (exact likelihood), burnin=200, sampfreq= 
2, and nsample=500. These last three options will lead to a much 
shorter MCMC chain, with a total of 1200 iterations. Run the 
MCMC sampling twice, and check for convergence using the 
ESS, histograms, and trace plots. How long does it take for the 
sampling to complete? Can you estimate how long it would take to 
run the analysis using 2,020,000 iterations, as long as for the 
approximate method of Sect. 3.3.2? Did the two chains converge 
despite the low number of iterations? 


There is some controversy over whether }Sahelanthropus, used to 
set the minimum bound for the human-chimp divergence, is 
indeed part of the human lineage. The next (younger) fossil in 
the human lineage is f Orrorin which dates to around 6 Ma. Modify 
file data/10s.tree and change the calibration in the human- 
chimp node to B(0.057, 0.10, 0.01, 0.2). Also change the calibra- 
tion on the root node to B(0.615, 1.315, 0.01, 0.05). Run the 
MCMC analysis with the approximate method and again sampling 
from the prior. Are there any substantial differences in the posterior 
distributions of times under the new fossil calibrations? Which 
nodes are affected? How bad is the truncation effect among the 
calibration densities and the prior? 


This is a difficult exercise. Use R to plot the prior densities of times 
sampled using MCMC (the same as in Fig. 10). Now try to work 
out how to overlay the calibration densities onto the plots. For 
example, see Fig. 3 in [23] for an idea. First, write functions that 
calculate the calibration densities. The dunif function in Ris useful 
to plot uniform calibrations. Functions sn::dsn and sn::dst 
(in the SN package) are useful to plot the skew-¢ (ST) and skew- 
normal (SN) distributions. Calibration type S2N (Table 1) is a 
mixture of two skew-normal distributions [15]. How do the sam- 
pled priors compare to the calibration densities? Are there any 
substantial truncation effects? 


Good taxon sampling is critical to obtaining robust estimates of 
divergence times for clades. In the data/ directory, an alignment 
of the first and second codon positions from mitochondrial 
protein-coding genes from 330 species (326 primate and 
4 out-group species) is provided, 330s.phys, with corresponding 
tree topology, 330s.tree. First, place the fossil calibrations of 
Table 1 on the appropriate nodes of the species tree. Then obtain 
the gradient and Hessian matrix for the 330-species alignment 
using the HKY + G model. Finally, estimate the divergence times 
on the 330-species phylogeny by using the approximate likelihood 
method. How does taxon sampling affect node age estimates when 
comparing the 10-species and 330-species trees? How does 
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uncertainty in node ages in the large tree, which was estimated on a 
short alignment, compare with the estimates on the small tree, but 
with a large alignment? 
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Chapter 11 


Genome Evolution in Outcrossing vs. Selfing vs. Asexual 


Species 


Sylvain Glemin, Clementine M. Francois, and Nicolas Galtier 


Abstract 


A major current molecular evolution challenge is to link comparative genomic patterns to species’ biology 
and ecology. Breeding systems are pivotal because they affect many population genetic processes and thus 
genome evolution. We review theoretical predictions and empirical evidence about molecular evolutionary 
processes under three distinct breeding systems—outcrossing, selfing, and asexuality. Breeding systems may 
have a profound impact on genome evolution, including molecular evolutionary rates, base composition, 
genomic conflict, and possibly genome size. We present and discuss the similarities and differences between 
the effects of selfing and clonality. In reverse, comparative and population genomic data and approaches 
help revisiting old questions on the long-term evolution of breeding systems. 


Key words Breeding systems, GC-biased gene conversion, Genome evolution, Genomic conflicts, 
Selection, Transposable elements 


1 ‘Introduction 


In-depth investigations on genome organization and evolution are 
increasing and have revealed marked contrasts between species, 
e.g., evolutionary rates, nucleotide composition, and gene reper- 
toires. However, little is still known on how to link this “genomic 
diversity” to the diversity of life history traits or ecological forms. 
Synthesizing previous works in a provocative and exciting book, 
M. Lynch asserts that variations in fundamental population genetic 
processes are essential for explaining the diversity of genome archi- 
tectures while emphasizing the role of the effective population size 
(N.) and nonadaptive processes [1]. Life history and ecological 
traits may influence population genetic parameters, including Ne, 
making it possible to link species’ biology and their genomic orga- 
nization and evolution (e.g., [2-7 ]) 

Among life history traits affecting population genetic pro- 
cesses, breeding systems are pivotal as they determine the way 
genes are transmitted to the next generation (Fig. 1). Outcrossing, 
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Fig. 1 Reproduction and genotype transmission in outcrossing, selfing, and 
asexual species. In outcrossers, parental and recombinant (dotted lines) 
gametes from distinct zygotes are shuffled at generation n + 1. In selfers, only 
gametes produced by a given zygote can mate, which quickly increases homo- 
zygosity and reduces the recombination efficacy. Asexuals do not undergo 
meiosis or syngamy. They reproduce clonally 


sexual species (outcrossers) reproduce through the alternation of 
syngamy (from haploid to diploid) and meiosis (from diploid to 
haploid), with random mating of gametes from distinct individuals 
at each generation. Outcrossing is a common breeding system that 
is predominant in vertebrates, arthropods, and many plants, espe- 
cially perennials, etc. [8, 9]. Selfing species (selfers) also undergo 
meiosis, but fertilization only occurs between gametes produced by 
the same hermaphrodite individual. Consequently, diploid indivi- 
duals from selfing species are highly homozygous (FIS ~ 1; see, for 
instance, ref. 10)—heterozygosity is divided by two at each genera- 
tion, and the two gene copies carried by an individual have a high 
probability of being identical by descent. Selfing is common in 
various plant families (eg, Arabidopsis thaliana), mollusks, 
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nematodes (e.g., Caenorhabditis elegans), and platyhelminthes, 
among others [8, 9]. Note that many sexual species have interme- 
diate systems in which inbreeding and outbreeding coexist. In 
organisms with a prolonged haploid phase (such as mosses, ferns, 
or many algae and fungi), a more extreme form of selfing can occur 
by taking place during the haploid phase (haploid selfing or intra- 
gametophytic selfing), leading instantaneously to genome-wide 
homozygosity [11]. Clonal asexual species, finally, only reproduce 
via mitosis, so that daughters are genetically identical to mothers 
unless a mutation occurs. In diploid asexuals, homologous chro- 
mosomes associated in a given zygote do not segregate in distinct 
gametes—they are co-transmitted to the next generation in the 
absence of any haploid phase. In contrast to selfing species, indivi- 
duals from asexual diploid species tend to be highly heterozygous 
(FIS ~ —1, [12]), since any new mutation will remain at the 
heterozygote stage forever, unless the same mutation occurs in 
the homologous chromosome. Clonality is documented in insects 
(e.g., aphids), crustaceans (e.g., daphnia), mollusks, vertebrates, 
and angiosperms, among others [13-16]. As for selfing, clonality 
can also be partial, with sexual reproduction occurring in addition 
or in alternation with asexual reproduction. In addition to this 
common form of asexuality, other forms such as automixis imply 
a modified meiosis in females where unfertilized diploid eggs pro- 
duce offspring potentially diverse and distinct from their mother, 
leading to different levels of heterozygosity [13]. This diversity of 
reproductive systems should be kept in mind, but for clarity we will 
mainly compare outcrossing, diploid selfing, and clonality. 
Through the occurrence, or not, of syngamy, recombination, 
and segregation, breeding systems affect population genetic para- 
meters (effective population size, recombination rate, efficacy of 
natural selection; Fig. 2) and thus, potentially, genomic patterns. A 
large corpus of population genetic theory has been developed to 
study the causes and consequences of the evolution of breeding 
systems (Table 1). Thanks to the exponentially growing amount of 
genomic data, and especially data from closely related species with 
contrasted breeding systems, it is now possible to test these theo- 
retical predictions. Conversely, genomic data may help in under- 
standing the evolution of breeding systems. Genomes should 
record the footprints of transitions in breeding systems and help 
in testing the theory of breeding system evolution in the long run, 
e.g., the “dead-end hypothesis,” which posits that selfers and asex- 
uals are doomed to extinction because of their inefficient selection 
and low adaptive potential [17, 18]. Since the first edition of this 
book, several theoretical developments have clarified the popula- 
tion genetics consequences of the different breeding systems, and 
empirical evidences have been accumulating, partly changing our 
view of breeding system evolution and consequences, especially for 
asexual organisms. We first review and update the consequences of 
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Fig. 2 A schematic representation of the effect of breeding systems on population genetic parameters 


Table 1 


Summary of the major theoretical predictions regarding breeding systems and evolutionary genomic 
variables, with outcrossing being taken as reference 


Fis nS dN/dS Codon usage TE LD GC-content 
Outcrossing ~0 + + + + + + 
Selfing ~l — ++ = Unclear ++ = 
Asexuality ~—l = +++ = Unclear +++ = 


TE transposable element abundance, LD linkage disequilibrium 


breeding systems on genome evolution and then discuss and 
re-evaluate how evolutionary genomics shed new light on the old 
question of breeding system evolution. 


2 Contrasted Genomic Consequences of Breeding Systems 


2.1 Consequences of 
Breeding Systems on 
Population Genetics 
Parameters 


Sex involves an alternation of syngamy and meiosis. In outcrossing 
sexual species, random mating allows alleles to spread across popu- 
lations, while segregation and recombination (here in the sense of 
crossing-over) associated with meiosis generate new genotypic and 
haplotypic combinations. This strongly contrasts with the case of 
selfing and asexual species. In such species, alleles cannot spread 
beyond the lineage they originated from because mating occurs 
within the same lineage (selfers) or because syngamy is suppressed 
(asexuals). Recombination, secondly, is not effective in 
non-outcrossers. In selfers, while physical recombination does 
occur (7), effective recombination (7) is reduced because it mainly 
occurs between homozygous sites, and it completely vanishes 
under complete selfing: for tight linkage, ře = 7o(1 — Fis), where 
Fig is the Wright’s fixation index [19], whereas for looser linkage, 
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effective recombination is more reduced than predicted by this 
simple expression [20-22]. In asexuals, physical recombination is 
suppressed (7 = ře = 0). High levels of linkage disequilibrium 
(nonrandom association of alleles between loci) could therefore 
be expected in selfers and asexuals. The observed data are mainly 
consistent with these predictions. In the selfing model species 
Arabidopsis thaliana, LD extends over a few hundreds of kb, 
while in maize, an outcrosser, LD quickly vanishes beyond a few 
kb [23]. In a meta-analysis, Glémin et al. [24] also found higher LD 
levels in selfers than in outcrossers. Beyond pairwise LD, selfing 
also generates higher-order associations, such as identity disequili- 
bria (the excess probability of being homozygote at several loci, 
[25]) that alter population genetics functioning compared to out- 
crossing populations (e.g., [26]). 

Theory also predicts that the effective population size, Ne, 
depends on the breeding system (Fig. 2). First, compared to out- 
crossers, selfing is expected to directly lower N. by a factor 1 + De 
by reducing the number of independent gametes sampled for 
reproduction [27 ]. From a coalescent point of view, selfing reduces 
coalescent time (again by the same factor 1 + Fig). Under out- 
crossing, two gene copies gathered in a same individual either 
directly coalesce or move apart at the preceding generation. Selfing 
prolongs the time spent within an individual, hence the probability 
of coalescing [19, 28]. In diploid asexuals, the picture is less obvi- 
ous. Since genotypes, not alleles, are sampled, Balloux et al. [12] 
distinguished between the genotypic and allelic effective size. The 
genotypic effective size equals N, not 2 N, i.e., the actual population 
size, similarly to the expectation under complete selfing. On the 
contrary, the allelic effective size tends toward infinity under com- 
plete clonality because genetic diversity within individuals cannot 
be lost [12]. This corresponds to preventing coalescence as long as 
gene copies are transmitted clonally [29, 30]. However, very low 
level of sex (higher than 1/2 N) is sufficient to retrieve standard 
outcrossing coalescent behavior [29, 30], and as far as natural 
selection is concerned (see below), the genotypic effective size is 
what matters [31]. The ecology of selfers and asexuals may also 
contribute to decreasing N, as they supposedly experience more 
severe bottlenecks than outcrossers [32, 33]. On the contrary, 
higher population subdivision in selfers could contribute to increas- 
ing N; at the species scale. However, Ingvarsson [34] showed that, 
under most conditions, the extinction/recolonization dynamics is 
predicted to decrease Ne in selfers, at both the local and metapo- 
pulation scale. Finally, because of low or null effective recombina- 
tion, hitchhiking effects—the indirect effects of selection at a locus 
on other linked loci—reduce N, further [35]. Under complete 
selfing or clonality, because of full genetic linkage, selection at a 
given locus affects the whole genome. Most forms of selection, and 
especially directional selection, reduce the number of gene copies 
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contributing to the next generation by removing deleterious alleles 
to the benefit of advantageous ones. Because of linkage, such a 
reduction spreads over the rest of the genome, globally reducing 
the effective population size (sensu lato) in non-outcrossing spe- 
cies. Background selection, the reduction in N, due to the removal 
of deleterious mutations at linked loci, can be particularly severe in 
highly selfing and clonal population, potentially reducing Ne by one 
order of magnitude or more [22, 36]. And this effect is expected to 
be stronger in asexuals than in selfers [36]. In the predominantly 
selfing nematode C. elegans, nucleotide diversity has been shown to 
be reduced genome wide by both background selection [37] and 
selective sweeps [38], and in a comparative analysis, the effect of 
linked selection has shown to be more pronounced in selfing than 
in outcrossing species [39]. 

As genetic diversity scales positively with MA, where p is the 
mutation rate, selfers are expected to be less polymorphic than 
outcrossers. Asexuals should also exhibit lower genotypic diversity, 
but the prediction is not clear for allelic diversity (see above). 
However, because of the lack of recombination, haplotype diversity 
should be lower for both breeding systems. The effect of selfing on 
the polymorphism level is well documented, and empirical data 
mainly agree with the theoretical predictions. Selfing species tend 
to be more structured, less diverse, and straightforwardly more 
homozygotes than outcrossers [6, 24, 40, 41]. Much fewer data 
exist regarding diversity levels in asexuals, but the available datasets 
confirm that genotypic diversity, at least, is usually low in such 
species (see discussion in ref. 12). At the population level, a recent 
comparative analysis of sexual and asexual Aptinothrips rufus grass 
thrips confirmed the expected lower nuclear genetic diversity of 
asexual populations while also evidencing that some asexuals with 
extensive migration can feature very high mitochondrial genetic 
diversity [42]. 

These predictions concerning polymorphism patterns implic- 
itly assumed that mutation rates are the same among species with 
contrasted breeding systems. However, modifications in breeding 
systems can also affect various aspects of the species life cycle 
potentially related to the mutation rate. In asexuals, for instance, 
loss of spermatogenesis can reduce mutation rates, while loss of the 
dormant sexual phase can increase them (reviewed in [43]). Muta- 
tion rates can also be decreased in non-outcrossers due to the loss of 
recombination, which can be mutagenic [44, 45]. In selfers, meio- 
sis and physical recombination do occur. However, the specific 
mutagenic process during meiosis depends on the level of hetero- 
zygosity, such as indel-associated mutations (IDAM): heterozygote 
indels could increase the point mutation rate at nearby nucleotides 
because of errors during meiosis [46, 47]. Consistent with this 
prediction, the IDAM process more strongly affects the outcrossing 
wild rice, Oryza rufipogon, than the very recent selfer and weakly 
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Outcrossing: h = 0.5 
Selfing: any h 
Outcrossing: A = 0.3 
BW Clonality: h = 0.7 


S 
—0.0002 —0.0001 0.0000 0.0001 0.0002 


Fig. 3 Substitution rates relative to the neutral case (dN/dS) in outcrossers (thin lines), selfers (bold line), and 
asexuals (dotted lines) for different mutation dominance levels. The fitness of the resident, heterozygote, and 
homozygote mutant genotypes are 1, 1 — hs, and 1 — s, respectively. For asexuals, it is necessary to consider 
two substitution rates corresponding to the initial fixation of heterozygotes and the ultimate fixation of 
complete homozygote mutants from an initially heterozygote population [31]. Population size: N = 10,000. 
To highlight the difference between selfers and asexuals due to segregation, demographic and hitchhiking 
effects reducing M, in asexuals and selfers are not taken into account 


heterozygous domesticated rice, O. sativa. A. thaliana, a more 
ancient and mostly homozygous selfer, is very weakly affected by 
IDAM [48]. Overall, these processes should globally contribute to 
lowering mutation rates, and thus polymorphism, in selfing and 
asexual species. 


2.2 Breeding The effective population size strongly affects the outcome of natu- 
Systems and Selection ral selection. The probability of fixation of a new mutation is a 
Efficacy function of the N.s product, where s is the selection coefficient 


([49] and see Fig. 3). As N. is reduced, a higher proportion of 
mutations behave almost neutrally. Weakly deleterious alleles can 
thus be fixed, while weakly advantageous ones can be lost. Genetic 
associations among loci generated by selfing and clonality also 
induce selective interferences [26, 50]. Because of their reduced 
effective population size and recombination rate, selection is thus 
expected to be globally less effective in selfers and asexuals than in 
outcrossers, which should result in various footprints at the molec- 
ular level (Table 1). Assuming that most mutations are deleterious 
(with possible back compensatory mutations), both the ratio of 
non-synonymous to synonymous polymorphism, CN / aS, and the 
ratio of non-synonymous to synonymous substitutions, dN/dS, are 
predicted to be higher in selfers and asexuals than in outcrossers. 
Codon usage should also be less optimized in selfers and asexuals 
than in outcrossers. 

Contrary to polymorphism surveys, few studies have tested 
these predictions empirically (Table 2). In the few available 


2.2.1 Drift and 
Recombination: Parallel 
Reduction in Selection 
Efficacy in Selfers and 
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comparative studies, contrasted patterns were observed between 
selfers and asexuals. Compared to sexual ancestors, recent asexual 
lineages show a marked increase in the dN/dS ratio in Daphnia 
([51] but see below), Timema stick insects [52], gastropods Cam- 
peloma [53] and Potamopyrgus [54], and the plant Boechera [55], in 
agreement with theoretical predictions (Table 2). However, no 
significant effect of asexuality on dN/dS was found in four aphid 
species [56] and in the plant Ranunculus auricomus [57]. Bdelloid 
rotifers, long considered as ancient asexuals (see below), exhibit a 
higher xN/zS ratio but not a higher dN/dS ratio than comparable 
sexual groups, suggesting that mildly deleterious mutations can 
segregate at a higher frequency in asexuals but are eventually 
removed. A higher xN/2S ratio in asexual lineages than in sexual 
relatives was reported from transcriptome data in Oenothera prim- 
roses [58] and Lineus nemerteans [59]. Note however that in the 
latter case, the increased aN/2S is primarily explained by the hybrid 
nature of the asexual Lineus pseudolacteus (Table 2). The recent 
origin of asexuality through introgression also challenges the inter- 
pretation of elevated dN/dS ratio in the mitochondrial genome of 
asexual lineages of Daphnia pulex [51], as less than 1% of mutations 
on the branches leading to asexual lineages would have arisen after 
the transition to asexuality [60]. Here, rather than being the direct 
cause of genomic degradation, asexuality may have evolved in 
already-degraded lineages. 

All predictions are not equally supported by data in selfers. 
Polymorphism-based measures mostly support reduction in selec- 
tion efficiency in selfers in various plant species, and this was 
recently confirmed by a meta-analysis of genome-wide polymor- 
phism data ([6] and see Table 2). On the contrary, as far as dN/dS 
or base composition are compared, most studies, in plants, fungi, 
and animals, did not find evidence of relaxed selection in selfers 
(Table 2). A recent origin of selfing is often invoked to explain that 
effect of selfing is rarely observed in species divergence (e.g., 
[61, 62-64]), whereas a recent transition to selfing can leave a 
clear signature of relaxed selection at the polymorphism level 
[65]. In contrast, in the freshwater snail Galba truncatula where 
selfing is supposed to be old and ancestral to a clade of several 
species, relaxed selection in the selfing lineage was also observed 
at the divergence level [66]. The same rationale should apply to 
asexual species. However, in Campeloma, Potamopyrgus, Timema, 
and Boechera, clonality is also recent, yet the expected patterns are 
observed at the divergence level. The reduction in Ne could simply 
be less severe in selfers than in asexuals as predicted by background 
selection models [36]. Furthermore, complete selfing is hardly ever 
noted in natural populations; residual outcrossing typically occurs. 
Among hitchhiking effects, some are very sensitive to the recombi- 
nation level, such as Muller’s ratchet [67], weak Hill-Robertson 
interferences [50], or hitchhiking of deleterious mutations during 
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selective sweeps [68, 69]. If such mechanisms are the main cause of 
reduction of N., in selfers, then even a low recombination rate could 
be enough to maintain the selection efficacy. This is suggested by 
genomic patterns across recombination gradients in outcrossing 
species. In primates, no effect of recombination on the selection 
efficacy has been detected [70]. In Drosophila, Haddrill et al. [71] 
found little evidence of reduced selection in low recombining 
regions, except when recombination was fully suppressed, as in Y 
chromosomes. Differences between selfers and asexuals could thus 
simply result from different degrees of residual outcrossing. How- 
ever, as stated above, selfers and asexuals also fundamentally differ 
as far as segregation is concerned, as we now discuss in more detail. 


Selfing affects the selection efficacy by increasing homozygosity and 
thus exposing recessive alleles to selection. This effect can counter- 
act the effect of reducing N.. Considering the sole reduction in Ne 
due to non-independent gamete sampling, selection is less efficient 
under partial selfing for dominant mutations but more efficient for 
recessive ones (Fig. 3, and see ref. 72). More precisely, Glémin [73 ] 
determined the additional reduction in N, (due to hitchhiking and 
demographic effects) necessary to overcome the increased selection 
efficacy due to homozygosity. This additional reduction can be high 
for recessive mutations. On the contrary, the lack of segregation in 
asexuals reduces selection efficacy and increases the drift load, as 
heterozygotes can fix [31]. The effects of selfing and clonality on 
the fixation probability of codominant, recessive, or dominant 
mutations are summarized in Fig. 3. Note that segregation may 
also have indirect effects. When recombination is suppressed, Mul- 
ler’s ratchet is supposed to reduce N. and contribute to the fixation 
of weakly deleterious alleles [74]. In selfers, the purging of partially 
recessive deleterious alleles slows down the ratchet [67], which 
suggests that the fixation of deleterious alleles at linked loci would 
be lower in selfers than in asexuals. The same mechanism also 
contributes to weaker background selection in selfers than in asex- 
uals (see above, [36]). In the extreme case of intra-gametophytic 
selfing, purging could be even more efficient at removing deleteri- 
ous alleles [11], as it has been suggested for moss species [75 ]. Seg- 
regation at meiosis could thus partly explain the differences 
between selfers and asexuals, but more data are clearly needed to 
confirm this hypothesis. 

The two opposite effects of drift and segregation in selfers 
should also affect adaptive evolution. In outcrossers, new beneficial 
mutations are more likely to be rapidly lost if recessive, as they are 
initially present in heterozygotes and masked to selection—a pro- 
cess known as Haldane’s sieve [76]. By unmasking these mutations 
in homozygotes, selfing could help adaptive evolution from reces- 
sive mutations [72, 73]. However, this advantage of selfing disap- 
pears when adaptation proceeds from pre-existing variation because 
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homozygotes can also be present in outcrossers [77]. Selective 
interference in selfers also reduces their advantage of not experien- 
cing Haldane’s sieve, especially for weakly beneficial mutations 
[21], and the effect of background should globally reduce the 
rate of adaptation [73, 77, 78]. Conversely, the lack of segregation 
in asexuals delays the complete fixation of an advantageous muta- 
tion. Once a new advantageous mutation gets fixed in the hetero- 
zygotic state, additional lag time until occurrence and fixation of a 
second mutation is necessary to ensure fixation [79]. Little is 
known about the dominance levels of new adaptive mutations, 
but a survey of QTL fixed during the domestication process in 
several plant species confirmed the absence of Haldane’s sieve in 
selfers compared to outcrossers [80]. This mostly corresponds to 
strong selection on new mutations or mutations in low initial 
frequencies in the wild populations. More generally, the effect of 
selfing on adaptive evolution will depend on the distribution of 
dominance and selective effects of mutations and the magnitude of 
genetic drift and linkage. 

Few studies have tested for difference in positive selection 
between selfers and outcrossers. In their survey of sequence poly- 
morphism data in flowering plants, Glémin et al. [24] found, on 
average, more genes with a signature of positive selection in out- 
crossers than in selfers assessed by the McDonald-Kreitman test 
[81]. An extension of this method—where 
non-synonymous vs. synonymous polymorphism data are used to 
calibrate the distribution of the deleterious effects of mutations and 
then attribute the excess non-synonymous divergence observed to 
positive selection [82 ]—was applied to one plant [83] and one 
freshwater snail dataset. In both studies, a large fraction of 
non-synonymous substitutions was estimated to be adaptive in 
the outcrossing species (~40% in the plant Capsella grandiflora 
and ~55% in the snail Physa acuta), whereas this proportion was 
not significantly different from zero in the selfer (Arabidopsis thali- 
ana and Galba truncatula, respectively). Based on methods where 
the dN/dS ratio is allowed to vary both among branches and sites, a 
comparative analysis of two outcrossing and two selfing Triticeae 
species [84] suggested that adaptive substitutions may have specifi- 
cally occurred in the outcrossing lineages. This would contribute to 
explaining why selfing lineages did not show a higher dN/dS ratio 
than outcrossing ones (see above and Table 2). So the data available 
so far support an increased rate of adaptation in outcrossing species, 
suggesting that the effects of drift and linkage overwhelm the 
advantage of avoiding Haldane’s sieve. A similar approach was 
used in Oenothera species suggesting also reduced adaptive evolu- 
tion in clonal compared to sexual lineages [85]. 

Finally, the classical assumption of a lack of segregation in 
asexuals must be modulated. First, in some form of asexuality, 
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such as automixis, female meiosis is retained, and diploidy restora- 
tion occurs by fusion or duplication of female gametes. Depending 
on how meiosis is altered, automixis generates a mix of highly 
heterozygous and highly homozygous regions along chromo- 
somes. The genomes of such species could thus exhibit a gradient 
of signatures of selfing and diploid clonal evolution [86]. Secondly, 
mitotic recombination and gene conversion in the germline of 
asexual lineages can also reduce heterozygosity at a local genomic 
scale. Mitotic recombination has been well documented in yeast (see 
review in ref. 87) and also occurs in the asexual trypanosome T. b. 
gambiense [88] and in asexual Daphnia lineages [60, 89, 90]. If its 
frequency is of the order or higher than mutation rates, as reported 
in yeast and Daphnia, asexuals would not suffer much from the lack 
of segregation at meiosis. Especially, during adaptation, the lag time 
between the appearance of a first beneficial mutation and the final 
fixation of a mutant homozygote could be strongly reduced 
[87]. However, such mechanisms of loss of heterozygosity also 
rapidly expose recessive deleterious alleles in heterozygotes and 
generate inbreeding-depression-like effects [60]. 


So far, we have only considered the immediate, mechanistic effects 
of breeding systems on population genetic parameters. Breeding 
systems, however, can also affect the evolution of genetic systems 
themselves, which modulates previous predictions. Theoretical 
arguments suggested that selfing, even at small rates, greatly 
increases the parameter range under which recombination is 
selected for [91-93]. These predictions have been confirmed in a 
meta-analysis in angiosperms in which outcrossers exhibited lower 
chiasmata counts per bivalent than species with mixed or selfing 
mating systems [94]. Higher levels of physical recombination (79) 
could thus help break down LD and reduce hitchhiking effects. 
This could contribute to explaining why little evidence of long- 
term genomic degradation has been observed in selfers, compared 
to asexuals. 

Breeding systems may also affect selection on mutation rates. 
Since the vast majority of mutations are deleterious, mutation rates 
should tend toward zero, up to physiological costs of further 
reducing mutation rates being too high (e.g., [95, 96]). Under 
complete linkage, a modifier remains associated with its “own” 
mutated genome. Selection should thus favor lower mutation 
rates in asexuals and selfers (e.g., [95, 96]). However, Lynch 
recently challenged this view and suggested a lower limit to DNA 
repair may be set by random drift, not physiological cost [97]. Such 
a limit should thus be higher in asexuals and selfers. Asexuality is 
often associated with very efficient DNA repair systems (reviewed in 
[43]), supporting the view that selection for efficient repair may 
overwhelm drift in asexual lineages. Alternatively, only groups 
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2.3 Breeding 
Systems and Genomic 
Conflicts 


2.3.1 Relaxation of 
Sexual Conflicts in Selfers 
and Asexuals 


having high-fidelity repair mechanisms could maintain asexuality in 
the long run. More formal tests of mutation rate differences 
between breeding systems are still scarce. The phylogenetic 
approach revealed no difference in dS, as a proxy of the neutral 
mutation rate, between A. thaliana and A. lyrata [61], nor did a 
mutation accumulation experiment that compared the deleterious 
genomic mutation rate between Amsinckia species with contrasted 
mating systems [98]. A similar experiment in Caenorhabditis 
showed that the rate of mutational decay was, on average, fourfold 
greater in gonochoristic outcrossing taxa than in the selfer 
C. elegans [99]. Recent mutation accumulation experiments on 
Daphnia pulex suggested a slightly lower mutation rate in obligate 
than in facultative asexual genotypes, except for one mutator phe- 
notype which evolved in an asexual subline [90]. Overall, these 
results do not support Lynch’s hypothesis of mutation rates being 
limited by drift in asexual and selfing species. However, such experi- 
ments are still too scarce, and quantifying how mutation rates vary 
or not with breeding systems is a challenging issue that requires 
more genomic data. 


Outcrossing species undergo various sorts of genetic conflict. Sex- 
ual reproduction directly leads to conflicts within (e.g., for access to 
mating) and between sexes (e.g., for resource allocations between 
male and female functions or between offspring). In selfers and 
asexuals, such conflicts occur because mates are akin or because 
mating is absent [100, 101]. Outcrossers are also sensitive to 
epidemic selfish element proliferation and to meiotic drive, because 
alleles can easily spread over the population through random mat- 
ing. In contrast, selfers and asexuals should be immune to such 
genomic conflicts because selection only occurs between selfing or 
asexual lineages so that selfish elements should be either lost or 
evolve into commensalists or mutualists [102 ]. 


Some genes involved in sexual reproduction are known to evolve 
rapidly because of recurrent positive selection [103]. Arm races for 
mating or for resource allocation to offspring are the most likely 
causes of this accelerated evolution. In selfers and asexuals, selec- 
tion should be specifically relaxed on these genes, not only because 
of low recombination and effective size but mainly because the 
selection pressure per se should be suppressed. According to this 
prediction, in the outcrosser C. grandiflora, 6 out of the 20 genes 
that show the strongest departure from neutrality are reproductive 
genes and under positive selection. This contrasts with the selfer 
A. thaliana, for which no reproductive genes are under positive 
selection [83]. 
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More specifically, two detailed analyses provided direct evi- 
dence of relaxed selection associated with sexual conflict reduction. 
In the predominantly selfer C. elegans, some males deposit a copu- 
latory plug that prevents multiple matings. However, other males 
do not deposit this plug. A single gene (plg-1), which encodes a 
major structural component of this plug, is responsible for this 
dimorphic reproductive trait [104]. Loss of the copulatory plug is 
caused by the insertion of a retrotransposon into an exon of plg-1. 
This same allele is present in many populations worldwide, suggest- 
ing a single origin. The strong reduction in male-male competition 
following hermaphroditism and selfing evolution explains that no 
selective force opposes the spread of this loss-of-function allele 
[104, 105]. In A. thaliana, similar relaxed selection has been 
documented in the MEDEA gene, an imprinted gene directly 
involved in the male vs. female conflict. MEDEA is expressed 
before fertilization in the embryo sac and after fertilization in the 
embryo and the endosperm, a tissue involved in nutrient transfer to 
the embryo. In A. lyrata, an outcrossing relative to A. thaliana, 
MEDEA could be under positive [106] or balancing selection 
[107], in agreement with permanent conflicting pressures for 
resource acquisition into embryos between males and females. 
Conversely, this gene evolved under purifying selection in 
A. thaliana, where the level of conflict is reduced. 

Male vs. female diverging interests are also reflected by cyto- 
nuclear conflicts. When cytoplasmic inheritance is uniparental, as in 
most species, cytoplasmic male sterility (CMS) alleles favoring 
transmission via females at the expense of males can spread in 
hermaphroditic outbreeding species, leaving room for coevolution 
with nuclear restorers. Maintenance of CMS/non-CMS polymor- 
phism leads to stable gynodioecy [108]. In selfers, CMS mutants 
also reduce female fitness—because ovules cannot be fertilized— 
and are thus selected against. In the genus Silene, the mitochondrial 
genome of gynodioecious species exhibits molecular signatures of 
adaptive and/or balancing selection. This is likely due to cyto- 
nuclear conflicts as this is not, or is less, observed in hermaphrodites 
and dioecious [109-111]. Although less studied, cyto-nuclear con- 
flicts are also expected in purely hermaphroditic species. In a recent 
study in A. lyrata, Foxe and Wright [112] found evidence of 
diversifying selection on members of a nuclear gene family encod- 
ing transcriptional regulators of cytoplasmic genes. Some of them 
show sequence similarity with CMS restorers in rice. Given the 
putative function of these genes, such selection could be due to 
ongoing cyto-nuclear coevolution. Interestingly, in A. thaliana, 
these genes do not seem to evolve under similar diversifying selec- 
tion, as expected in a selfing species where conflicts are reduced. 


346 Sylvain Glémin et al. 


2.3.2 Biased Gene 
Conversion as a Meiotic 
Drive Process: 
Consequences for 
Nucleotide Landscape and 
Protein Evolution 


GC-biased gene conversion (gBGC) is a kind of meiotic drive at the 
base pair scale that can also be strongly influenced by breeding 
systems. In many species, gene conversion occurring during 
double-strand break recombination repair is biased toward G and 
C alleles (reviewed in [113]). This process mimics selection and can 
rapidly increase the GC content, especially around recombination 
hotspots [114, 115], and, more broadly, can affect genome-wide 
nucleotide landscapes. For instance, it is thought to be the main 
force that shaped the isochore structure of mammals and birds 
[116]. gBGC has been mostly studied by comparing genomic 
regions with different rates of (crossing-over) recombination 
(reviewed in [116]). However, comparing species with contrasted 
breeding systems offers a broader and unique opportunity to study 
gBGC. gBGC cannot occur in asexuals because recombination is 
lacking. Selfing is also expected to reduce the gBGC efficacy 
because meiotic drive does not occur in homozygotes [117]. To 
our knowledge, GC content has never been compared between 
sexual and asexual taxa, but there have been comparisons between 
outcrossers and selfers. 

As expected, no relationship was found between local recombi- 
nation rates and GC-content in the highly selfing Arabidopsis thali- 
ana [117], and Wright et al. [118] suggested that the (weak) 
differences observed with the outcrossing A. lyrata and Brassica 
oleracea could be due to gBGC. Much stronger evidence has been 
obtained in grasses. Grasses are known to exhibit unusual genomic 
base composition compared to other plants, being richer and more 
heterogeneous in GC-content [119], and direct and indirect evi- 
dences of gBGC have been accumulating [119, 120-122]. Accord- 
ingly, GC-content or equilibrium GC values were found to be 
higher in outcrossing than in selfing species [24, 84, 120]. Differ- 
ence in gBGC between outcrossing and selfing lineages has also 
been found in the plant genus Collinsia [123] and in freshwater 
snails [66], although difference in selection on codon usage cannot 
be completely ruled out. 

gBGC can also affect functional sequence evolution, leaving a 
spurious signature of positive selection and increasing the mutation 
load through the fixation of weakly deleterious AT—GC muta- 
tions: gBGC would represent a genomic Achilles’ heel 
[124]. Once again, comparing outcrossing and selfing species is 
useful for detecting interference between gBGC and selection. 
gBGC is expected to counteract selection in outcrossing species 
only. The Achilles’ heel hypothesis could explain why relaxed selec- 
tion was not detected in four grass species belonging to the Triti- 
ceae tribe [84]. In outcrossing species, but not in selfing ones, 
dN/dS was found to be significantly higher for genes exhibiting 
high than low equilibrium GC-content, suggesting that selection 
efficacy could be reduced because of high substitution rates in favor 
of GC alleles in these outcrossing grasses. In outcrossing species, 
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gBGC can maintain recessive deleterious mutations for a long time 
at intermediate frequency, in a similar way to overdominance 
[125]. This could generate high inbreeding depression in outcross- 
ing species, preventing the transition to selfing. In reverse, recur- 
rent selfing would reduce the load through both purging and the 
avoidance of gBGC, thus reducing the deleterious effects of 
inbreeding. Under this scenario, gBGC would reinforce disruptive 
selection on mating systems. In the long term, gBGC could be a 
new cost of outcrossing: because of gBGC, not drift, outcrossing 
species could also accumulate weakly deleterious mutations, to an 
extent which could be substantial given current estimates of gBGC 
and deleterious mutation parameters [125]. Whether this gBGC- 
induced load could be higher than the drift load experienced by 
selfing species remains highly speculative. Both theoretical works, 
to refine predictions, and empirical data, to quantify the strength of 
gBGC and its impact on functional genomic regions, are needed in 
the future. Grasses are clearly an ideal model for investigating these 
issues, but comparisons with groups having lower levels of gBGC 
would also be helpful. 


Considering the role of sex in the spread of selfish elements, TEs 
should be less frequent in selfers and asexuals than in outcrossers 
because they cannot spread from one genomic background to 
another through syngamy. However, highly selfing and asexual 
species derive from sexual outcrossing ancestors, from which they 
inherit their load of TEs. TE distribution eventually depends on the 
balance between additional transposition within selfing/clonal 
lineages on one hand and selection or excision on the other. Fol- 
lowing the abandonment of sex, large asexual populations are 
expected to purge their load of TEs, provided excision occurs, 
even at very low rates. However, purging can take a very long 
time, and, without excision, TEs should slowly accumulate, not 
decline [126]. In small populations, even with excision, a Muller’s 
ratchet-like process drives TE accumulation throughout the 
genome [126]. Transition from outcrossing to selfing should also 
rapidly purge TEs, but as for asexuals, in small fully selfing popula- 
tions, TEs can be retained [127]. Using yeast populations, it was 
experimentally confirmed that sex increases the spread of TEs 
[128, 129]. TE numbers were also found to be higher in cyclically 
sexual than in fully asexual populations of Daphnia pulex 
[130-132] (Table 3), contrary to what was described in the para- 
sitoid wasp Leptopilina clavipes and in root knot nematode species 
(Table 3). It should be noted that several comparative studies on 
asexual arthropods, nematodes, primroses, and green algae did not 
evidence any significant effect of breeding system on TE content or 
evolution (Table 3). At larger evolutionary scales, the putatively 
ancient asexual bdelloid rotifers strikingly exemplify the fact that 
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asexuals can purge their load of TEs. Unlike all sexual eukaryotes, 
they appear to be free of vertically transmitted retrotransposon, 
while their genome contains DNA transposons, probably acquired 
via horizontal transfers [133, 134]. Examples of TE accumulation 
in asexuals are less common, maybe because species are doomed to 
extinction under this evolutionary scenario [135]. However, the 
increase in genome size in some apomictic lineages of Hypericum 
species may result from this process [136]. 

In selfers, the distribution of TEs depends not only on the 
population size but also on the mode of selection against TEs 
[127, 137]. Under the “deleterious” model, TE insertions are 
selected against because they disrupt gene functions. According 
to the “ectopic exchange” model, TEs are selected against because 
they generate chromosomal rearrangements through unequal 
crossing-over between TE at nonhomologous insertion sites. 
Under the first of these two models, homozygosity resulting from 
selfing increases the selection efficacy against TEs, while under the 
second one, under-dominant chromosomal rearrangements are less 
selected against in selfing than in outcrossing populations 
[127, 137]. A survey of Tyl-copia-like elements in plants suggests 
that they are less abundant in self-fertilizing than in outcrossing 
plants, thus supporting the “deleterious” rather than the “ectopic” 
exchange model [127]. The distribution of retrotransposons in 
self-incompatible and self-compatible Solanum species also sup- 
ports the “deleterious” model, even though most insertions are 
probably neutral [138] (Table 3). In the selfer Arabidopsis thali- 
ana, selection efficacy against TEs seems to be reduced compared 
to its outcrossing sister species A. lyrata [139, 140], but compari- 
son of the two complete genomes revealed a higher load of TE in 
A. lyrata and a recent decrease in TE in number in A. thaliana, in 
agreement with the date of transition to selfing [141]. In the 
Capsella genus, while the very recent selfer C. rubella possesses a 
slightly higher number of TEs than the outcrossing C. grandiflora, 
the oldest selfer C. orientalis exhibits a significantly reduced load of 
TE [142] (Table 3). Other selfish elements, such as B chromo- 
somes, are also less frequent in selfers, in support of the view that 
inbreeding generally prevents selfish element transmission [102]. 


Atypical breeding systems are often associated with polyploidy 
[143], and the reasons for this association are not entirely clear. 
Polyploid mutants might be more likely to establish as new lineages 
in selfers and asexuals than in obligate outcrossers if crosses 
between polyploids and diploids are unfertile or counterselected. 
This is because at low population frequency a polyploid mutant will 
experience the disadvantage of mostly mating with diploids—the 
minority cytotype exclusion principle [144, 145 ]—unless it repro- 
duces asexually or via selfing. In addition, by doubling gene copy 
number, polyploidy might alleviate the fitness cost of recessive 
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deleterious mutations being exposed at homozygous state in selfers 
[146]. Kreiner et al. [147] reported that in Brassicaceae the rate of 
production of unreduced gametes is higher in asexuals than in 
outcrossers, suggesting that mating systems can influence not 
only the establishment rate but also the mutation rate to polyploidy. 

Recent genome-wide data analyses have revealed that a number 
of polyploid selfers or asexuals actually correspond to allopolyploids 
(e.g., [59, 148-151]), highlighting the possibility that hybridiza- 
tion plays a role in breeding system and ploidy evolution. Hybridi- 
zation between facultative asexuals might cause immediate 
transition to obligate asexuality if the two progenitor genomes are 
so divergent that meiosis is impaired—e.g., due to chromosomal 
rearrangements, or in case of genetic incompatibilities affecting 
genes involved in sexual reproduction [16]. Numerous selfing or 
asexual lineages, either diploid or polyploid, are known to be of 
hybrid origin (e.g., [13, 152-157]). Hybridization would therefore 
appear as a potential cause, and polyploidy a potential consequence, 
of atypical breeding systems [16], but more genome-wide data are 
obviously needed to draw firm conclusions on these complex 
relationships. 


As argued above, breeding systems can affect many aspects of 
genome content and organization. They should also affect the 
whole genome size. Following Lynch’s theory [1], genome size 
should be higher in selfers and asexuals because of their reduced 
effective population size, hence reduced ability to get rid of useless, 
slightly costly sequences. However, the picture is probably more 
complex. First, because of the recent origin of many selfing and 
(at least some) asexual lineages, relaxed selection may not have 
operated longly enough to impact genome size. Second, because 
of their immunity to selfish element transmission, selfers and asex- 
uals should exhibit lower genome size, especially in groups where 
TEs are major determinants of genome size. Hence, it is not clear 
whether genetic drift or resistance to selfish elements (or other 
processes) is the most important in governing genome size evolu- 
tion in various breeding systems. 

Meta-analyses performed in plants provided equivocal answers. 
Analysis of the distribution of B chromosomes showed a strong and 
significant positive association between outcrossing, the occurrence 
of B chromosomes, and genome size [102, 158]. However, after 
phylogenetic control, only the association between breeding sys- 
tems and B chromosomes remains. Whitney et al. [159] simulta- 
neously tested the effect of breeding systems (using outcrossing 
rate estimates) and genetic drift (using polymorphism data) on 
genome size in seed plants. Raw data showed a significant effect 
of both breeding systems and genetic drift, according to theoretical 
predictions. However, no effect was observed after phylogenetic 
control, leading the authors to reconsider the hypothesis of a role 
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of nonadaptive processes in genome size evolution. Similarly, phy- 
logenetic comparative analysis of 30 primrose species ( Oenothera) 
covering several transitions to asexuality showed no significant 
relationship between reproductive mode and genome size [160]. 

Because breeding systems can evolve quickly, more detailed 
analyses at a short phylogenetic scale are needed to get a clearer 
picture of their effects on genome size evolution. Moreover, breed- 
ing systems are often correlated with other life history traits, such as 
lifespan, which can make it hard to clarify the causes and conse- 
quences of the observed correlations. A detailed analysis of genome 
size in the Veronica genus suggests that selfing, not annuality, is 
associated with genome size reduction [161]. A comparison of 
14 pairs of plant congeneric species with contrasted mating systems 
also suggested a genome size reduction in selfers [162]. However, 
this could partly have been due to the four polyploid selfing species 
of the dataset—polyploidy can lead to haploid genome size reduc- 
tion because of the loss of redundant DNA following polyploidiza- 
tion. A better understanding can be gained from the comparative 
analysis of genome composition and organization, not only 
genome size. In Caenorhabditis nematodes, the observed reduction 
in genome size is not driven by reduction in TEs but by a global loss 
of all genomic compartments [163]. This pattern contradicts the 
hypothesis of relaxed selection in selfers against the accumulation of 
deleterious genomic elements. Alternatively, it could be explained 
by deletion bias and high genetic drift in selfers. However, in 
mutation accumulation lines, insertions predominate over deletion 
in the selfing C. elegans, and deletions occurred at the whole gene 
level instead of being at random among genomic compartments, as 
predicted under a general deletion bias (see discussion in ref. 163). 
In this genus, Lynch’s hypothesis that evolution of genome size 
should be driven by changes in N, does not apply. Alternatively, the 
authors suggested that it is a more direct consequence or even an 
adaptation to the selfing lifestyle, although the underlying mechan- 
isms still remain unclear. 


3 A Genomic View of Breeding System Evolution 


Because breeding systems can strongly affect genome structure and 
evolution, conversely, genomic approaches offer new powerful 
tools to reconstruct breeding system evolution and to test evolu- 
tionary hypotheses, especially concerning long-term evolution. 
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3.1 Genomic 
Approaches to Infer 
Breeding System 
Evolution 


3.1.1 Genomic 
Characterization of 
Breeding Systems 


Genetic markers have long been used to determine breeding sys- 
tems and quantify selfing rates or degrees of asexuality. For 
instance, current selfing rates can be inferred using molecular mar- 
kers through De estimates or preferably—although more time 
consuming—through progeny analyses [164-166]. Multilocus- 
based estimates that take identity disequilibrium into account 
greatly improve the simple Fjs-based method that is sensitive to 
several artifacts such as null alleles ([167], see also refs. 168, 169). 
This method, implemented in the RMES software [167], has 
proven to give results very similar to progeny-based methods 
[170]. To take advantage of the information potentially available 
in sequence data, coalescence-based estimators have also been pro- 
posed to infer long-term selfing rates, and they have been imple- 
mented more recently in a Bayesian clustering approach in the 
INSTRUCT software package [171]. However, this approach 
mostly captures information from recent coalescence events so 
that such approaches still estimate recent selfing rates [28]. Much 
more information about long-term selfing rates can be derived 
from LD patterns [19], but this has not been fully exploited for 
selfing rate estimators (for instance, LD is not taken into account in 
INSTRUCT). Similarly, recombination can be inferred using 
genetic markers or sequence data, and more generally, various 
methods have been proposed to characterize the degree of clonality 
in natural populations (for review see ref. 172) and recently imple- 
mented in the R package RClone [173]. 

Initially, such methods were applied with few markers, from 
which only global descriptions of breeding systems were deducible. 
Thanks to the considerable increase in sequencing facilities, it has 
become possible to finely characterize temporal and spatial varia- 
tions in breeding systems. In A. thaliana, an analysis of more than 
1000 individuals in 77 local stands using more than 400 SNP 
markers revealed spatial heterogeneity in outcrossing rates. Local 
“hotspots” of recent outcrossing (up to 15%) were identified, while 
other stands exhibited complete homozygosity with no detectable 
outcrossing [174]. Interestingly, at this local scale (from 30 m to 
40 km), outcrossing rates have been found to be twofold higher on 
average in rural than in urban stands; hence, selfing could be 
associated with higher disturbance in urban stands. 

Genomic data may also help characterize breeding systems in 
species with unknown or ill-characterized life cycles. In yeasts Sac- 
charomyces cerevisiae and S. paradoxus, the analyses of linkage dis- 
equilibrium patterns allowed to quantify the frequency of (rare) 
sexual reproduction events and the proportion of inbreeding and 
outcrossing during these events [175, 176]. For instance, in the 
pico-algae Ostreococcus, no sexual form or process has been detected 
in the lab. However, the occurrence of infrequent recombination 
(about 1 meiosis for 10 mitoses) inferred from a population geno- 
mics approach and the presence of meiosis genes in the genome 


3.1.2 Inferring and 
Dating Breeding System 
Transitions 
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support the existence of a sexual life cycle [177]. Moreover, a 
strong negative correlation between chromosome size and 
GC-content has been observed [178]. In mammals and birds 
(among others), such a pattern has been interpreted as a long- 
term effect of gBGC acting on chromosomes with different average 
recombination rates [116]—small chromosomes having higher 
recombination rates because of the constraint of at least one chias- 
mata per chromosome arm. A similar interpretation for Ostreococcus 
is thus appealing. Genomic data also allow to test whether the 
theoretical signatures of long-term asexuality are observed in puta- 
tive asexuals. As an example, whole-genome analyses of the try- 
panosome T. b. gambiense demonstrated an independent evolution 
and divergence of alleles on each homologous chromosome (the 
“Meselson effect” [179, 180]), which is indicative of strict asexual 
evolution [88]. In contrast, genomic studies of the putatively 
ancient asexual bdelloids recently uncovered the occurrence of 
inter-individual genetic exchanges ([181, 182] see below Subhead- 
ing 3.2.2). 


Genomic approaches are also useful for analyzing the dynamics of 
breeding system evolution. A simple way is to map breeding system 
evolution on phylogenies, which could provide a raw picture of the 
frequency and relative timing of breeding system transitions (e.g., 
[183]). However, these approaches, based on ancestral character 
reconstruction, are hampered by numerous uncertainties. For 
instance, in the case of two sister species with contrasting breeding 
systems, such as A. thaliana and A. lyrata, it is impossible to know 
whether A. thaliana evolved toward selfing just after divergence 
(about five million years ago) or only very recently. At a larger 
phylogenetic scale, inferring rates of transition between characters 
and ancestral states can be biased if diversification rates differ 
between characters—this is typically expected with breeding sys- 
tems for which asexuals and selfers should exhibit higher extinction 
rates than outcrossers [184]. 

Thanks to the genomic signatures left by contrasted breeding 
systems, it is possible to trace back transitions in the past and to date 
them more precisely. In diploid asexual species, because of the 
arrest of recombination, the two copies of each gene have diverged 
independently since the origin of asexuality. After having calibrated 
the molecular clock, it is thus possible to date this origin from the 
level of sequence divergence between the two copies. This so-called 
Meselson effect was observed and quantified in the trypanosome T. 
b. gambiense, suggesting that this species evolved asexually about 
10,000 years ago [88]. However, no Meselson effect has been 
observed in other presumably ancient asexual species such as oriba- 
tid mites [185 ] or darwinulid ostracods [186], while data refute the 
possibility of cryptic sex. In such cases, it is thus not possible to infer 
when recombination actually stopped, presumably because of 
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homogenizing processes such as very efficient DNA repair or auto- 
mixis. Mitotic recombination could also obscure the pattern pre- 
dicted under this Meselson effect. Of note, when asexuality 
originates by hybridization (see above Subheading 2.4), the last 
common ancestor of the two copies of a gene dates back to the 
ancestor of the two parental lineages, which can be much older than 
the hybridization date, faulting the above-described rationale. 

Past transitions from outcrossing to selfing have also been 
investigated, through either population genomics approaches or 
the evolutionary analysis of self-incompatibility (SI) genes, which 
are directly involved in the transition to selfing. Since the evolution 
of selfing requires the breakdown of SI systems, initially constrained 
S-locus genes are expected to evolve neutrally after a shift to selfing. 
In A. thaliana, Bechsgaard et al. [187] reasoned that the dN/dS 
ratio in the selfing lineage should be the average of the neutral 
dN/dsS (i.e., 1) and the outcrossing dN/dS—inferred from sister 
lineages—weighted by the time spent in the selfing vs. the out- 
crossing state. They deduced that SRK, one of the major SI genes, 
became a pseudogene less than 400,000 years ago. SRK, however, 
is not the only gene involved in SI. Mutations in other genes may 
have previously disrupted the SI system, thus confusing SRK-based 
dating. Indeed, coalescence simulations showed that the observed 
genome-wide pattern of linkage disequilibrium is compatible with 
the transition to selfing one million years ago [188], suggesting a 
possible but debated two-step scenario in the evolution of selfing 
[189, 190]. The persistence of three distinct divergent SRK haplo- 
types among extant A. thaliana individuals also suggests multiple 
loss of SI [191], but the recent discovery of the co-occurrence of 
the three haplotypes in Moroccan populations makes possible the 
evolution of selfing in a single geographic region [192]. In another 
Brassicaceae, i.e., Capsella rubella, analyses of both S-locus and 
genome-wide genes coupled with coalescence simulations sug- 
gested that selfing evolved very recently from the outcrosser 
C. grandiflora, around 50,000 years ago [193, 194] from a poten- 
tially large number of founding individuals followed by a strong 
reduction in N. [195]. In the tetraploid selfer Arabidopsis suecica, 
which originated as a hybrid between A. thaliana and the out- 
crossing A. arenosa, the genomic analysis of the S-locus also 
revealed the origin of selfing, suggesting an instantaneous loss of 
SI due to the fixation of nonfunctional alleles from both parents 
around 16,000 years ago [150]. 


3.2 Matching 
Breeding System 
Evolution Theories 
with Genomic Data 


3.2.1 Testing the Dead- 
End Hypothesis: 
Comparison Between 
Selfing and Asexuality 
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The expected reduction in N, in selfers and asexuals may increase 
the drift load (accumulation of slightly deleterious mutations) and 
preclude adaptation. Selfing and clonality are thus supposed to be 
evolutionary dead ends [17, 18]. The twiggy phylogenetic distri- 
butions of asexuals [196] and selfers [183] or self-compatible 
species [197] suggest they are mostly derived recently from out- 
crossing ancestors (but see ref. 198). However, this observation may 
not be sufficient to support the dead-end hypothesis, and neutral 
models can also explain this pattern [199-201]. In a comprehensive 
and epochal phylogenetic study of several Solanaceae genera, Gold- 
berg et al. [202] went further by testing the irreversibility of 
transitions. Using a phylogenetic method developed for estimating 
the character effect on speciation and extinction [203, 204], they 
showed that self-compatible species have both higher speciation 
and extinction rates—with the resulting net diversification rates 
being lower—than self-incompatible species. This was the first 
direct demonstration of the dead-end hypothesis, and additional 
results have been obtained in Primula species [205]. On the con- 
trary, in the Oenothera genus, asexuality has been found associated 
with increased diversification but frequent reversion toward the 
sexual system, suggesting that the form of asexuality in this group 
is not an evolutionary dead end [206]. 

Genomic data also provide an opportunity to investigate the 
genetic causes of such long-term evolutionary failures. The 
increased dN/dS ratios reported in asexuals (see above) suggest 
that deleterious point mutations contribute to the load. However, 
in Daphnia rapid exposure of recessive deleterious alleles through 
mitotic recombination or gene conversion likely has a much stron- 
ger effect on clone persistence than their long-term accumulation 
under Muller’s ratchet [60]. TE could also contribute to the load 
and to the extinction of asexuals [135], though more data are still 
needed to unambiguously support this hypothesis (but see ref. 
136). The pattern in selfers is less clear. While theory globally 
predicts a reduction in selection efficacy in selfers, models also 
highlight conditions under which selection can be little affected 
or even enhanced in selfers [72, 73, 207], especially regarding TE 
accumulation [127, 137]. Empirical data on both protein and TE 
evolution have not revealed any strong evidence of long-term 
accumulation of deleterious mutation in selfers, as compared to 
outcrossers, whereas polymorphism data mainly support relaxation 
of selection in selfers (Table 2). This is in agreement with the recent 
origin of selfing but makes difficult further inference of the under- 
lying causes of higher extinction in selfers as trait-dependent diver- 
sification processes alter the relationship between life history traits 
and rate of molecular evolution [208 ]. A reduced ability to respond 
to environmental changes through adaptive evolution could also 
contribute to long-term extinction in asexuals (but see ref. 209) and 
selfers, especially if standing variation is needed to rescue 
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populations experiencing environmental challenges [77, 210]. Few 
studies, however, have compared the rate of adaptation in selfers 
and outcrossers (see Table 2). Theoretical predictions regarding this 
effect, moreover, critically depend on the dominance level of new 
favorable mutations [72, 73, 77, 210], which are poorly known 
(but see ref. 80). 

While several issues remain open, current knowledge suggests 
that selfers are less prone to extinction than asexuals. The wider 
distribution of selfing than clonality in plants supports this view 
[211, 212]. Selfers could go toward extinction more slowly than 
asexuals, and the causes of their extinction could differ. Since 
deleterious mutations should accumulate at a slower rate in selfers 
than in asexuals, as suggested by theory and current data, this 
process would likely not be sufficient to drive them to extinction. 
The reduced adaptive potential could be the very cause of their 
ultimate extinction as initially proposed by Stebbins [18], which 
could generally occur before sufficient deleterious mutations have 
accumulated to be detected via molecular measures of divergence. 
On the contrary, in asexuals, the accumulation of deleterious muta- 
tions could be fast enough to leave a molecular signature and 
contribute to extinction. Alternatively, demographic characteristics 
associated with uniparental reproduction, such as recurrent bottle- 
necks, fragmented populations, and extinction/recolonization 
dynamics, could be sufficient to drive population extension simply 
because of higher sensitivity to demographic stochasticity (see also 
ref. 213). Genomic degradation would only be the witness of the 
evolution toward selfing and clonality without being the ultimate 
cause of their extinctions. These hypotheses need to be further 
investigated by building more realistic demo-genetic model and 
by better integrating genomic and ecological approaches. 

The literature reviewed above focuses on intrinsic factors that 
may affect the extinction rate of selfing and asexual species, taken as 
established lineages, compared to their sexual relatives. Alterna- 
tively, Janko et al. [199] suggested that if asexual mutants are 
produced at a relatively high rate and compete with each other, 
this would imply a rapid turnover between clonal lineages and a 
young expected age for extant asexuals, without the need to invoke 
any fitness effect (see also refs. 200, 201). Of note, this model 
invokes competitive exclusion among clonal lineages, but not 
between clonal and sexual ones—the ancestral sexual gene pool is 
assumed to be immune from extinction. 


The few putatively ancient asexuals known so far seem to escape the 
mutational load predicted by the dead-end hypothesis and avoid 
extinction over long evolutionary time scales. For example, fossil 
evidence and decades of microscopic observations indicate that 
bdelloid rotifers have apparently persisted for over 40 million 
years without meiosis, males, or conventional sexual reproduction 
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[15, 214]. As a matter of fact, the first genome assembly published 
for these organisms confirmed that their genome structure is 
incompatible with conventional meiosis [215 ]. However, two inde- 
pendent studies recently demonstrated that bdelloids could experi- 
ence genetic exchanges between individuals. 

A first article by Debortoli et al. [182] evidenced frequent 
horizontal exchanges of genetic fragments between individuals of 
the species Adineta vaga (Adinetidae). Such horizontal transfers 
could be promoted by the peculiar ecology of these rotifers, which 
experience frequent desiccations damaging their cell and nucleus 
membranes and thus allowing for the entry of foreign DNA in the 
cells. In addition, desiccation induces multiple DNA double-strand 
breaks, facilitating the integration of foreign DNA during repair 
processes. 

Another study by Signorovitch et al. [181] identified a pattern 
of allele sharing between individuals of the species Macrotrachela 
quadricornifera (Philodinidae) that was incompatible with strict 
asexual evolution. The authors suggested that bdelloids had 
evolved an atypical meiotic mechanism similar to what has been 
described in some species of primroses ( Oenothera), in which chro- 
mosomes organize into a ring during meiosis without requiring 
homologous chromosome pairing [216]. They advocated that even 
rare events of such unconventional sex could be enough to generate 
the observed pattern of allele sharing. 

In the absence of conventional meiosis and syngamy, bdelloid 
rotifers might thus have escaped extinction by maintaining some 
level of genetic exchanges between individuals, either through 
horizontal gene transfers or unconventional Oenothera-like meio- 
sis. Regardless of the underlying molecular mechanisms, bdelloids 
should not be considered as “ancient asexual scandals” anymore. 
These recent results call for a reassessment of the reproductive 
mode of all supposedly ancient asexuals (see Subheading 3.1.1 
above). The rise of genomic studies in recent years will greatly 
contribute to decipher whether putative asexuals evolve as strict 
asexuals or have developed new alternatives to sex. 


4 Conclusion and Prospects 


There is a large body of theory on the effects of breeding systems on 
molecular evolution. However, some of them have not been clearly 
verified by empirical data, and numerous questions remain. Geno- 
mic data have also partly unveiled the complexity of breeding 
systems, especially in asexual or presumably asexual species. 
Promising prospects include (1) analysis of the rate and pattern of 
transition to selfing /asexuality using densely sampled phylogenies 
with appropriate breeding system distributions combined with 
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genome-wide molecular data, (2) distinguishing between the dif- 
ferent forms of selection with a better characterization of the fitness 
effect of mutations, (3) explicitly accounting for the possible asso- 
ciation between breeding system shifts and non-equilibrium demo- 
graphic dynamics (e.g., bottlenecks in selfers, clone turnover in 
asexuals). A large theoretical corpus has already been developed, 
and thanks to the increasing availability of genomic data, qualitative 
patterns are now rather well described and partly understood. 
Another challenge in the future is also to make our predictions 
and tests more quantitative. 


1. What population genetic parameters are affected, and how, by 
selfing and asexuality? 


2. What are the potential problems when comparing the dN/dS 
ratio between selfers and outcrossers or sexuals and asexuals? 


3. What is the evolutionary “dead-end hypothesis,” and how can 
we test it using phylogenetic and evolutionary genomic tools? 
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Carolin Kosiol and Maria Anisimova 


Abstract 


Populations evolve as mutations arise in individual organisms and, through hereditary transmission, may 
become “fixed” (shared by all individuals) in the population. Most mutations are lethal or have negative 
fitness consequences for the organism. Others have essentially no effect on organismal fitness and can 
become fixed through the neutral stochastic process known as random drift. However, mutations may also 
produce a selective advantage that boosts their chances of reaching fixation. Regions of genomes where new 
mutations are beneficial, rather than neutral or deleterious, tend to evolve more rapidly due to positive 
selection. Genes involved in immunity and defense are a well-known example; rapid evolution in these 
genes presumably occurs because new mutations help organisms to prevail in evolutionary “arms races” 
with pathogens. In recent years genome-wide scans for selection have enlarged our understanding of the 
genome evolution of various species. In this chapter, we will focus on methods to detect selection on the 
genome. In particular, we will discuss probabilistic models and how they have changed with the advent of 
new genome-wide data now available. 


Key words Conserved and accelerated regions, Positive selection scans, Codon models, Selection- 
mutation models, Polymorphism-aware phylogenetic models 


1 Introduction 


In the past selection studies mainly focused on the analysis of 
particular loci such as genes, proteins, or regular elements of inter- 
est. With the availability of comparative genomic data, the emphasis 
has shifted from the study of individual proteins to genome-wide 
scans for selection. 

The search for selection can be performed on different levels 
comparing homologous nucleotide sequences or protein-coding 
genes in one or multiple genomes. The evolutionary processes in 
all these levels can be described by probabilistic models, which set 
the basis for evaluating selective pressures and selection tests. This 
book chapter will give an introduction into fundamental properties 
of the probabilistic models used to detect selection in the Subhead- 
ing 3 as well as examples of genome-wide scans. 
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sampled genotypes 


time 


Fig. 1 A diagram illustrating the different data and levels to analyze genomic 
sequences and the relationship of the various approaches modeling selection 


In Fig. 1, we summarize the different data levels and time scales 
of modeling selection on genomes. 


2 Comparative Genome Data 


Several whole genome sequence data sets are now available for 
selection scans. Mammalian genomes are well represented 
(in particular primates), and insect genomes are becoming more 
numerous (in particular Drosophila). These data can be down- 
loaded as orthologous alignments from the Ensembl [1] and 
UCSC [2] browsers. 

In light of recent advances in DNA sequencing, with so-called 
next generation sequencing (NGS) technologies that have dramat- 
ically reduced the cost and time needed to sequence an organism’s 
entire genome, large-scale (involving many organisms) sequencing 
projects have been and are currently being undertaken. Just to 
name a few, genome projects re-sequencing 1000 D. melanogaster 
[3] and 1001 Arabidopsis [4] were accomplished, and the 100,000 
human genome project [5] is ongoing. These polymorphism data 
from multiple individuals from several species enable us to detect 
very recent selection. 

Together with the progress in sequencing technologies, algo- 
rithmic advances now allow the de novo assembly of genomes from 
NGS data, including complex mammalian genomes (e.g., giant 
panda genome [6]). Therefore, not only international consortia 
but also small groups and individual labs can now envisage to 
sequence the organisms of their interest. As a consequence plat- 
forms for sharing this data have been established. For example, the 
Genome 10K project aims to assemble a genomic zoo—a collection 


3 Methods 


3.1 Probabilistic 
Models for Genome 
Evolution 
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of DNA sequences representing the genomes of 10,000 vertebrate 
species, approximately one for every vertebrate genus. All these 
genomes can be subject to scans for selection, for which we outline 
methods below. 


The statistical modeling of the evolutionary process is of great 
importance when performing selection studies. When comparing 
reasonably divergent sequences, counting the raw sequence iden- 
tity (percentage of sites with observed changes) underestimates the 
amount of evolution that has occurred because, by chance alone, 
some sites will have incurred multiple substitutions. In this chapter 
we discuss maximum likelihood (ML) and Bayesian methods to 
detect selection based on probabilistic models of character evolu- 
tion. Such substitution models provide more accurate evolutionary 
distance estimates by accounting for these unobserved changes and 
often explicitly model the selection pressures. 

One of the primary assumptions made in defining probabilistic 
substitution models is that future evolution is only dependent on its 
current state and not on previous (ancestral) states. Statistical pro- 
cesses with this lack of memory are called Markov processes. The 
assumption itself is reasonable, because during the evolution muta- 
tion and natural selection can only act upon the molecules present 
in an organism and have no knowledge of what came previously. 
However, some large-scale mutational events, such as recombina- 
tion [7], gene conversion (e.g., see [8, 9]), or horizontal transfer 
[10] might not satisfy this “memoryless” condition. 

To reduce the complexity of evolutionary models, it is often 
further assumed that each site in a sequence evolves independently 
from all other sites. There is evidence that the independence of sites 
assumption is violated. In real proteins, chemical interactions 
between neighboring sites or the protein structure affects how 
other sites in the sequence change. Steps have been made toward 
context-dependent models, where the specific characters at neigh- 
boring sites affect the sites evolution (e.g., see[11, 12]). 

The Markov model asserts that one sequence is derived from 
another by a series of independent substitutions, each changing one 
character in the first sequence to another character in the second 
during the evolution. Thereby we assume independence of evolu- 
tion at different sites. A continuous-time Markov process is fully 
defined by its instantaneous rate matrix Q= {qjj}ij-1 ... N- 

The diagonal elements of Q are defined by a mathematical 
requirement that the rows sum up to zero. For multiple sequence 
alignments, the substitution process runs in continuous time over a 
tree representing phylogenetic relations between the sequences. 
The transition probability matrix P(t) = {p;{t)} = e& consists of 
transition probabilities from residue 7 to residue 7 over time # and is 
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3.2 Detecting 
Regions of Accelerated 
Genome Evolution 


found as a solution of the differential equation dP(t)/dt = P(t)Q 
with DO) being the identity matrix. In order for tree branches to be 
measured by the expected number of substitutions per site, the Q- 
matrix is scaled so that the average substitution rate at equilibrium 
equals 1. 

As a matter of mathematical and computational convenience 
rather than biological reality, several simplifying assumptions are 
usually made. Standard substitution models allow any state to 
change into any other. Such Markov process is called irreducible 
and has a unique stationary distribution corresponding to the 
equilibrium codon frequencies a = Lac), Time reversibility implies 
that the direction of the change between two states 7 and 7 is 
indistinguishable, so that 2;p;{t) = 2,p;(t). This assumption helps 
to reduce the number of model parameters and is convenient when 
calculating the matrix exponential (Q-matrix of a reversible process 
has only real eigenvectors and eigenvalues [13]). Fully unrestrained 
Q-matrix for N characters defines an irreversible model with N 
(N — 1) — l free parameters, while for a reversible process this 
number is N( N + 1)/2 — 2. 

By comparing how well substitution models explain sequence 
evolution, and by examining the parameters estimated from data, 
ML and Bayesian inference can be used to address many biologi- 
cally important questions. In this section we focus on probabilistic 
models that are used to detect selection. 


Understanding the forces shaping the evolution of specific lineages 
is one of the most exciting areas in evolutionary genomics. In 
particular, regions of accelerated evolution in mammalian and 
insect species have been studied (e.g., see [14]). To eliminate non- 
functional regions, one strategy is to begin with a search for regions 
that are conserved through the mammalian history or longer. A 
likelihood ratio test (LRT) may be used to detect acceleration of 
rates in a lineage of interest, for example, the human lineage. Such 
LRT compares the likelihood of the alignment data under two 
probabilistic models. The null model has a single scale parameter 
representing shortening (more conserved) and lengthening (less 
conserved) of all branches of the tree. The alternative model has an 
additional parameter for the human lineage, which is constraint to 
be >1. This extra parameter allows the human branch to be rela- 
tively longer (accelerated) than the branches in the rest of the tree. 

For example, this approach was used to identify genomic 
regions that are conserved in most vertebrates but have evolved 
rapidly in humans. Interestingly, the majority of the human accel- 
erated regions (HARs) were noncoding, and many were located 
near protein-coding genes with protein functions related to the 
nervous system [14]. 

In contrast, the majority of Drosophila melanogaster accelerated 
regions (DMARs) are found in protein-coding regions and 


3.3 Codon Models: 
Site, Branch, and 
Branch-Site Specificity 


3.3.1 Basic Codon 
Models 
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primarily result from rapid adaptive change at synonymous sites 
[15]. This could be because flies have much more compact gen- 
omes compared to humans; however, even after considering the 
genomic content, in Drosophila a significant excess of DMARs 
occur in protein-coding regions. Furthermore, Holloway and col- 
leagues observed a mutational bias from G|C to A|T, and therefore 
the accelerated divergence in DMARs might be attributed to a shift 
in codon usage and a fixation of many suboptimal codons. 

In a similar manner, amino acid based models search for site- or 
lineage-specific rate accelerations and residues subject to altered 
functional constraints. Such sites are likely to be contributing to 
the change in protein function over time. The advantage of amino 
acid-based models is that they might be suitable for the analysis of 
deep divergences of fast-evolving genes, where sequences rapidly 
saturate over time. Also amino acid methods are not influenced by 
the effects of codon bias, a topic that is discussed at the end of this 
chapter. The idea is that adaptive change on the level of amino acid 
sequences may not necessarily correspond to an adaptive change in 
protein function but rather to peaks in the protein adaptive land- 
scape reflecting the optimization of the protein function in a par- 
ticular species to long-term environmental changes. One class of 
methods for detecting functional divergence searches for a lineage- 
specific change in the shape parameter of the gamma distribution 
that is used to model rate heterogeneity (see [16-19]). Other 
methods search for evidence of clade-specific rate shifts at individual 
sites (see [20-26]). For example, Gu [21] proposed a simple sto- 
chastic model for estimating the degree of divergence between two 
pre-specified clusters. The statistical significance was tested using 
site-specific profiles based on a hidden Markov model, which was 
used to identify amino acids responsible for these functional differ- 
ences between two gene clusters. More flexible evolutionary mod- 
els were incorporated in the maximum likelihood approach 
applicable to the simultaneous analysis of several gene clusters 
[27]. This was extended [28] to evaluate site-specific shifts in 
amino acid properties, in comparison with site-specific rate shifts. 
Pupko and Galtier [24] used the LRT to compare ML estimates of 
the replacement rate at an amino acid site in distinct subtrees. 


In protein-coding sequences, nucleotide sites at different codon 
positions usually evolve with highly heterogeneous patterns (e.g., 
[29]). Thus DNA substitution models fail to account for this 
heterogeneity unless the sequences are partitioned by codon posi- 
tions for the analysis. But even then, DNA models do not model 
the structure of genetic code or selection at the protein level. 
Indeed, one advantage of studying protein-coding sequences at 
the codon level is the ability to distinguish between nonsynon- 
ymous (AA replacing) and synonymous (silent) codon changes. 
Based on this distinction, the selective pressure on the protein- 
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coding level can be measured by the ratio œ = dyx/ds of the 
nonsynonymous to synonymous substitution rates. The nonsynon- 
ymous substitution rate may be higher than the synonymous rate, 
and thus œ > 1 due to fitness advantages associated with recurrent 
AA changes in the protein, i.e., positive selection on the protein. In 
contrast, purifying selection acts to preserve the protein sequence, 
so that the nonsynonymous substitution rate is lower than the 
synonymous rate, with œ < 1. Neutrally evolving sequences exhibit 
similar nonsynonymous and synonymous rates, with œ ~ 1. 

First methods that used the @ ratio as a criterion to detect 
positive selection were based on pairwise estimation of dy and ds 
rates with “counting” methods (e.g., see [30]). However, ML 
estimates of pairwise dy and ds based on a codon model were 
shown to outperform all other approaches [31]. Moreover, a Mar- 
kov codon model is naturally extended to multiple sequence align- 
ments, unlike the counting methods. This, together with the 
benefits of the probabilistic framework within which codon models 
are defined, made codon models very popular in studies of positive 
selection in protein-coding genes. 

The first two codon models were proposed simultaneously in 
the same issue of Molecular Biology and Evolution [32, 33]. The 
model of Goldman and Yang [32] included the transition/trans- 
version rate ratio x, and modeled the selective effect indirectly using 
a multiplicative factor based on Grantham [34] distances, but was 
later simplified to estimate the selective pressure explicitly using the 
œ parameter [35]. The main distinction between the first codon 
models concerns the way to describe the instantaneous rates with 
respect to equilibrium frequencies: (1) proportional to the equilib- 
rium frequency of a target codon (as in Goldman and Yang [32 ]) or 
(2) proportional to the frequency of a target nucleotide (as in Muse 
and Gaut [33]). 

In 2006, empirical codon models have been estimated (see 
[36, 37]) that summarize substitution patterns from large quanti- 
ties of protein-coding gene families. In contrast to the parametric 
codon models that estimate gene-specific parameters (e.g., 
transition-transversion x, selective pressure œw, etc.), the empirical 
codon models do not explicitly consider distinct factors that shape 
protein evolution. Standard parametric models assume that protein 
evolution proceeds only by successive single-nucleotide substitu- 
tions. However, empirical codon models indicate that model accu- 
racy is significantly improved by incorporating instantaneous 
doublet and triplet changes. Kosiol et al. [37] also found that the 
affiliations between codon, the amino acid it encodes, and the 
physicochemical properties of the amino acid are main driving 
factors of the process of codon evolution. Neither multiple nucleo- 
tide changes nor the strong influence of the genetic code nor amino 
acid properties form a part of the standard parametric models. 


3.3.2 Accounting for 
Variability of Selective 
Pressures 


3.3.3 Case Study: 
Application of a Genome- 
Wide Scan of Positive 
Selection on Six 
Mammalian Genomes 
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On the other hand, parametric models have been very success- 
ful in applications studying biological forces shaping protein evolu- 
tion of individual genes. Thus combining the advantages of 
parametric and empirical approaches offers a promising direction. 
Kosiol, Holmes, and Goldman [37] explored a number of com- 
bined codon models that incorporated empirical AA exchangeabil- 
ities from ECM while using parameters to study selective pressure, 
transition/transversion biases, and codon frequencies. Similarly, 
AA exchangeabilities from (suitable) empirical AA matrices may 
be used to alter probabilities of nonsynonymous changes, together 
with traditional parameters œ, x, and codon frequencies 2; [38]. In 
2013, De Maio et al. [39] extended the ECM approach to accom- 
modate site-specific variation of selective pressure and lineage- 
specific variation. Simulations showed that ECMs allowing for 
double and triple mutations is more conservative: they reduce the 
number of false positives and have less power to detect positive 
selection [39]. 


First codon models assumed constant nonsynonymous and synon- 
ymous rates among sites and over time. Although most proteins 
evolve under purifying selection most of the time, positive selection 
may drive the evolution in some lineages. During episodes of 
adaptive evolution, only a small fraction of sites in the protein 
have the capacity to increase the fitness of the protein via AA 
replacements. Thus approaches assuming constant selective pres- 
sure over time and over sites lack power in detecting genes affected 
by positive selection. Consequently, various scenarios of variation in 
selective pressure were incorporated in codon models, making 
them more powerful at detecting positive selection, and short 
episodes of adaptive evolution in particular. Evidence of positive 
selection on a gene can be obtained by a LRT comparing two 
nested models: a model that does not allow positive selection 
(constraining w < 1 to represent the null hypothesis) and a model 
that allows positive selection (@ > 1 is allowed in the alternative 
hypothesis). Positive selection is detected if a model œ > 1 fits data 
significantly better compared to the model restricting œ < 1 at all 
sites and lineages. However, the asymptotic null distribution may 
vary from the standard y? due to boundary problems or if some 
parameters become not estimable (e.g., see [40, 41]). 


In 2006, six high-coverage genome assemblies became available for 
eutherian mammals. The increased phylogenetic depth of this data 
set permitted Kosiol and colleagues [42] to perform several new 
lineage- and clade-specific tests using branch-site codon models. Of 
~16,500 human genes with high-confidence orthologs in at least 
two other species, 544 genes showed significant evidence of posi- 
tive selection using branch-site codon models and standard LRTs. 
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3.3.4 Selective Variability 
Among Codons: Site 
Models 


Interestingly, several pathways were found to be strongly 
enriched in genes with positive selection, suggesting possible 
coevolution of interacting genes. A striking example is the comple- 
ment immunity system, a biochemical cascade responsible for the 
elimination of pathogens. This system consists of several small 
proteins found in the blood that cooperate to kill target cells by 
disrupting their plasma membranes. Of 78 genes associated with 
this pathway in KEGG (see http://www.genome.jp/kegg-bin/ 
show_pathway?map04610 for the complement cascades), nine 
were under positive selection (FDR < 0.05), and five others had 
nominal P < 0.05. Most of genes under positive selection are 
inhibitors (DAF, CFH, CFI) and receptors (C5AR1, CR2), but 
some are part of the membrane attack complex (C7, C9, C8B), 
which punctures cell membranes to initiate cell lysis. Here we focus 
on the analysis of these proteins of the membrane attack complex. 

First we calculate gene averaged @ value using the basic MO 
model [32]. The ML estimates ofw < 1 (w = 0.31 for C7, œ = 0.25 
for C8B, and œ = 0.44 for C9) indicate that most sites in these 
genes are under purifying selection. However, selection pressure 
could be variable at different locations of the membrane proteins, 
and we therefore continue our analysis by applying models that 
allow for variation in selective pressure across sites. 


The simplest site models use the general discrete distribution with a 
pre-specified number of site classes. Each site class 7 has an inde- 
pendent parameter œw; estimated by ML together with proportions 
of sites p; in each class. Since a large number of site categories 
require many parameters, three categories are usually used (requir- 
ing five independent parameters). To test for positive selection, 
several pairs of nested site models were defined to represent the 
null and alternative hypotheses in LRTs. For example, model Mla 
includes two site classes, one with ou < 1 and another with on = 1, 
representing the neutral model of evolution (the null hypothesis). 
The alternative model M2a extends Mla by adding an extra site 
class with œ > 1 to accommodate sites evolving under positive 
selection. Significance of the LRT is tested using the y3-distribution 
for the M1 vs. M2 comparison. We test the C7 gene for positive 
selection by the LRT comparing nested models Mla and M2a 
(Table 1). 

Model M2a has two additional parameters compared to model 
Mla. The resulting LRT statistic is 2(log L2 — log L1) = 2 
(—6377.35 — (—6369.67)) = 2 x 7.68 = 15.36. This is much 
greater than the critical value of the chi-square distribution 
x (df = 2, at 5%) = 5.99, and we calculate a p-value of 
P = 5.0e—04. However, the Mla vs. M2a comparison for genes 
C8B and C9 is not significant. 


Table 1 
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Parameter estimates and log-likelihoods for a LRT of positive selection for the complement immunity 


component C7 


Mla (neutral) 


Site class 0 
Proportion po = 0.69 
@ ratio on = 0.07 


Log-likelihood L1 = —6377.35 
M2a (selection) 


Site class 0 
Proportion Po = 0.70 
@ ratio on = 0.08 


Log-likelihood L2 = —6369.67 


1 

(pi = 1 —- po = 0.31) 

(@ = 1) 

1 2 

pı = 0.29 Co = = jy = fra = OD) 
(on = 1) @ = 10.89 


The model M2a is the alternative model with a class of sites with œ > 1. The null hypothesis Mla is the same model but 


with oz = 1 fixed 


Another LRT can be performed on the basis of the modified 
model M8 with two site classes: one with sites where the @ ratio is 
drawn from the beta distribution (with 0 < œ < 1 describing the 
neutral scenario) and the second, discrete class, with œ > 1. Con- 
straining œ = 1 for this second class provides a sufficiently flexible 
null hypothesis, whereby all evolution can be explained by sites with 
o from the beta distribution or from a Seege site als with w = 1. 
Significance of the LRT is tested the mixture =y + ~ yj for the M8 
(@ = 1) vs. M8 comparison. If the LRT for positive selection is 
found to be significant, specific sites under positive selection may be 
predicted based on the values of posterior probabilities (PP) to 
belong to the site class under positive selection (usually 
PP > 0.95, but see [43, 44]). Such posterior probabilities are 
estimated using the naive empirical Bayesian approach (NEB, 
[45]), full hierarchical Bayesian approach ([46]; BEB [44]), or a 
mid-way approach — the Bayes empirical Bayes (BEB [44]). For a 
discussion on these approaches, see Scheffler and Seoighe [47] and 
Aris-Brosou [48]. Alternatively, Massingham and Goldman [49 ] 
proposed a site-wise likelihood ratio estimation to detect sites 
under purifying or positive selection. 

For the C7 gene, using BEB we identified several amino acids 
sites to be putatively under selection: residue R at position 
223 (PP = 0.94), H at position 239 (PP = 0.93), and N at position 
331 (PP = 0.93). Unfortunately, the crystal structures of C7 
(as well as C8B and C9) are not known, and we cannot relate the 
location of amino acids in the protein sequence to relevant 3D data, 
such as sites of protein-protein interaction or binding sites of the 
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3.3.5 Selective Variability 
over Time: Branch Models 


3.3.6 Temporal and 
Spatial Variation of 
Selective Pressure 


protein. If such structural information were known, it would also 
be possible to use this biological knowledge in a model that is aware 
of the position of the different structural elements. 

Site models that do not use a priori partitioning of codons 
(as those described above) are known as random-effect 
(RE) models. In contrast, fixed-effect (FE) models categorize 
sites based on a prior knowledge, e.g., according to tertiary struc- 
ture for single genes, or by gene category for multigene data. Site 
partitions for FE models can be defined also based on inferred 
recombination breakpoints, useful for inferences of positive selec- 
tion from recombining sequences (see [50, 51]); although the 
uncertainty of breakpoint inference is ignored in this way. FE 
models with each site being a partition should be avoided, as they 
lead to the “infinitely many parameter trap” (e.g., see[52]). Given a 
biologically meaningful a priori partitioning, FE models are useful 
to study heterogeneity among partitions. However, a priori infor- 
mation is not always available. 


A simple way to include the variation of the selective pressure over 
time is by using separate parameters w for each branch of a phylog- 
eny (known as free-ratio model; [35]). Compared with the one- 
ratio model (which assumes constant selection over time), the free- 
ratio model requires additional 2T — 4 œ parameters for T species. 
Figure 2 shows the estimates of the free-ratio model for the C8B 
gene. Although the ML estimates of œ values on the rodent lineages 
are visibly higher than on the primate lineages, none of the 
branches has o > 1. 

Other branch models can be defined by constraining different 
sets of branches of a tree to have an individual wœ. LRTs are used to 
decide (1) whether selective pressure is significantly different on a 
pre-specified set of branches and (2) whether these branches are 
under positive selection. 

However, branch models have relatively poor power to detect 
selection [53] in comparison to branch-site models that are dis- 
cussed in the next section. Also note that testing of multiple 
hypotheses on the same data requires a correction, so the overall 
false-positive rate is kept at the required level (most often 5%). 
Correction for multiple testing further reduces the power of the 
method, especially when many hypotheses are tested simulta- 
neously (see Subheading 4 later). 


Several solutions were proposed to simultaneously account for 
differences in selective constraints among codons and the episodic 
nature of molecular evolution at individual sites. One of the first 
models—model MA [45 ]—assumes four site classes. Two classes 
contain sites evolving constantly over time: one under purifying 
selection with oa < l; another with œ = 1. The other two site 
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Fig. 2 An estimate of œ for each branch of a six-species phylogeny. Shown is the 
maximum likelihood estimate for the gene 8B. Each branch is labeled with the 
corresponding estimate of œ 


classes allow selective pressure at a site to change over time on a 
pre-specified set of branches, known as the foreground. The two 
variable classes are derived from the constant classes so that sites 
typically evolving with oa < 1 or œ = 1 are allowed to be under 
positive selection with os > 1 on the foreground. Testing for 
positive selection on the rodent clade involves a LRT comparing a 
constrained version of MA (with @2 = 1) vs. an unconstrained MA 
model. Compared to branch models, the branch-site formulation 
improves the chance of detecting short spills of adaptive pressure in 
the past even if these occurred at a small fraction of sites. 

Returning to our example of gene C8B of the complement 
pathway, we perform a branch-site LRT for positive selection using 
the Mla vs. M2a comparison. Thereby we take mouse and the rat 
lineage, respectively, as foreground branches and all other branches 
as backgyound pranches. Significance of the LRT is tested the 
mixture =y + =yj with critical values to be 2.71 at 5%. For the 
C8B gent, we calculate 2(log L2 — log L1) = 2 x 2.23 = 4.46 for 
the mouse lineage and 11.2 for the rate lineage, respectively. 

A major drawback of described branch-site models is their 
reliance on a biologically viable a priori hypothesis. In context of 
detecting sites and lineages affected by positive selection, one pos- 
sible solution is to perform multiple branch-site LRTs, each setting 
a different branch at the foreground [54]. In the example of six 
species (Fig. 2), a total of nine tests (for an unrooted tree) are 
necessary in the absence of an a priori hypothesis. Multiple test 
correction has to be applied to control excessive false inferences. 
This strategy tends to be conservative but can be sufficiently pow- 
erful in detecting episodic instances of adaptation. As with all 
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model-based techniques, precautions are necessary for data with 
unusual heterogeneity patterns, which may cause deviations from 
the asymptotic null distribution and thus result in an elevated false- 
positive rate. 

In the case of episodic selection where any combination of 
branches of a phylogeny can be affected, a Bayesian approach in 
lieu of the standard LRTs and multiple testing have been suggested. 
The multiple LRT approach is most concerned with controlling the 
false-positive rate of selection inference and is less suited to infer the 
best-fitting selection history. In the hypothetical example (Fig. 2), a 
total of 2? — 1 = 511 selection histories (excluding the history 
without selection on any branch) need to be considered. The 
Bayesian analysis allows a probability distribution over possible 
selection histories to be computed and therefore permits estimates 
of prevalence of positive selection on individual branches and 
clades. Such approach evaluates uncertainty in selection histories 
using their posterior probabilities and allows robust inference of 
interesting parameters such as the switching probabilities for gains 
and losses of positive selection [42 ]. 

Other models (e.g., with ds variation among sites [55 ]) may be 
extended to allow changes of selective regimes on different 
branches. This is achieved by adding further parameters, one per 
branch, describing the deviation of selective pressure on a branch 
from the average level on the whole tree under the site model. Such 
model is parameter-rich and can be used for exploratory purposes 
on data with long sequences but does not provide a robust way of 
testing whether œ > 1 on a branch is due to positive selection on a 
lineage or due to inaccuracy of the ML estimation. 

Kosakovsky Pond and Frost [55] suggested detecting lineage- 
specific variation in selective pressure using the genetic algorithm 
(GA)—a computational analogue of evolution by natural selection. 
The GA approach was successfully applied to phylogenetic recon- 
struction. In the context of detecting lineage-specific positive selec- 
tion, GA does not require an a priori hypothesis. Instead the 
algorithm samples regions of the whole hypotheses space according 
to their “fitness” measured by AICc. The branch-model selection 
with GA may also be adapted to incorporate dy and ds among site 
variation, although this imposes a much heavier computational 
burden. 

In branch and branch-site models, change in selection regime is 
always associated with nodes of a tree, but the selective pressure 
remains constant over the length of each branch. Guindon et al. 
[56] proposed a Markov-modulated model where switches of selec- 
tion regimes may occur at any site and any time on the phylogeny. 
In a covarion-like manner, this codon model combines two Markov 
processes: one governs the codon substitution, while the other 
specifies rates of switches between selective regimes. These models 


3.3.7 Polymorphism- 
Aware Phylogenetic 
Models 


3.4 Software 
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can be used to study the patterns of the changes in selective pres- 
sures over time and across sites, by estimating the relative rates of 
changes between different selective regimes (purifying, neutral, and 
positive). 


Polymorphism-aware phylogenetic models (POMOs, [57, 58 ]) use 
polymorphism and divergence data simultaneously to estimate rel- 
ative mutation rates and scaled selection coefficients. Similar to 
DNA substitution models, the PoMo approach is based on a 
continuous-time Markov process to model evolution of hereditary 
sequences along a species tree. However, not only evolution of a 
single reference site but rather evolution of a population is 
considered. 

PoMo includes polymorphisms as states of the Markov chain, in 
addition to the four nucleotide states of classical nucleotide models. 
Sequence evolution is modeled as a gradual process made by small 
allele frequency changes. PoMo accounts for ancestral polymorph- 
isms and in particular for ancestral shared polymorphisms and 
incomplete lineage sorting (when two speciation events are sepa- 
rated by a lapse of time not sufficient for polymorphisms to reach 
fixation, see Maddison and Knowles [59]). The parameters in PoMo 
do not merely describe substitution rate but are also informative of 
mutation rates, fixation biases, root nucleotide frequencies, and 
branch lengths. All these parameters are estimated within a ML 
framework. De Maio et al. [57] performed a comprehensive study 
of evolutionary patterns of fourfold-degenerate sites in great apes 
populations. They show evidence in favor of variation in mutation 
and fixation rates between genomic regions with different base 
composition, contributing to the long-standing debate regarding 
the origin and maintenance of GC content variation (e.g., see Eyre- 
Walker and Hurst [60]). They found that both mutation rates and 
biased gene conversion vary with GC content. They also found 
lineage-specific differences, with weaker fixation biases in orangu- 
tan species, suggesting a reduced historical effective population 
size. As PoMo can distinguish between the contributions of muta- 
tion and fixation biases, it might also contribute to addressing the 
problem of disentangling signatures of selection and biased gene 
conversion (see Subheading 4.2). 


The software PHAST (PHylogenetic Analysis with Space/Time 
models) includes several phylo-HMM-based programs. Two pro- 
grams in PHAST are particularly interesting in the context of 
selection studies: PhastCons is a program for conservation scoring 
and identification of conserved elements (Siepel et al. [61 ]). PhyloP 
is designed to compute p-values for conservation or acceleration, 
either lineage-specific or across all branches (Pollard et al. [62]). 
Recently, the software can also be run through a webportal at 
http:/compgen.cshl.edu/phastweb/. 
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4 Notes/Discussion 


4.1 Quality of 
Multiple Alignments 


A variety of codon models to detect selection, including 
branch-site models and the recent selection-mutation model, are 
implemented in the CODEML program of PAML [63]. HYPHY is 
another implementation that includes a large variety of codon 
models [64]. PoMo has been implemented as part of the 
IQ-TREE software package (http://www.iqtree.org/) by 
Schrempf et al. [65]. 

These programs are primarily developed for maximum likeli- 
hood inference on a fixed tree. ML inference of phylogeny under 
codon models is possible with CodonPhyML, which allows to 
explicitly account for selection on the protein level [66]. 


With the wider use of codon models to detect selection, some 
questioned the statistical basis of testing based on branch-site mod- 
els. In 2004, Zhang found that the original branch-site test [67] 
produced excessive false positives when its assumptions were not 
met. The modified branch-site test was shown to be more robust to 
model violations (see [43, 68]) and is now commonly used in 
genome-wide selection scans (eg, see [69]). Recently, however, 
another simulation study by Nozawa et al. [70] suggested that 
this modification also showed an excess of false positives. Yang 
and Dos Reis [52] defended the branch-site test by examining the 
null distribution and showing that Nozawa and colleagues [70] 
misinterpreted their simulation results. However, it is clear that 
even tests with good statistical properties will be affected by data 
quality and the extent of models violations. Below we list factors 
that can affect the test and so should be taken into account when 
analyzing genome-wide data. 


The impact of the quality of sequence and the alignment is a major 
concern when performing positive selection scans. For example, in 
their analysis of 12 genomes Markova-Raina and Petrov [71] found 
that the results were highly sensitive to the choice of an alignment 
method. Furthermore, visual analysis indicated that most sites 
inferred as positively selected are in fact misaligned at the codon 
level. The rate of false positives ranged ~50% and more depending 
on the aligner used. Some of these results can be ascribed to the 
high divergence level of the 12 Drosophila species and could be 
addressed by better filtering of the data. Nevertheless, even in 
mammals where alignment is easier, problems have been observed. 

Bakewell et al. [72] used the branch-site test to analyze 
~14,000 genes from the human, chimpanzee, and macaque and 
detected more genes to be under positive selection on the chim- 
panzee lineage than on the human lineage (233 vs. 154). The same 
pattern was also observed by Arbiza et al. [73] and Gibbs et al. 


4.2 Biased Gene 
Conversion and 
Recombination 
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[74]. Mallick et al. [75 ] re-examined 59 genes detected to be under 
positive selection on the chimpanzee lineage by Bakewell et al. 
[72], using more stringent filters to remove less reliable nucleotides 
and using synteny information to remove misassembled and mis- 
aligned regions. They found that with improved data quality, the 
signal of positive selection disappeared in most of the cases when 
the branch-site test was applied. It now appears that, as suggested 
by Mallick et al. [75], the earlier discovery of more frequent posi- 
tive selection on the chimpanzee lineage than on the human lineage 
is an artifact of the poorer quality of the chimpanzee genomic 
sequence. This interpretation is also consistent with a few recent 
studies analyzing both real and simulated data, which suggest that 
sequence and alignment errors may cause excessive false positives 
(see [76, 77]). Indeed, most commonly used alignment programs 
tend to place nonhomologous codons or amino acids into the same 
column (see [78, 79]), generating the wrong impression that mul- 
tiple nonsynonymous substitutions occurred at the same site and 
misleading the codon models into detecting positive selection 
[77]. In 2012, Jordan and Goldman [80] investigated the effect 
of various multiple alignment and filtering programs on the identi- 
fication of positive selection. They found that alignment software 
PRANK [79] and the filter Guidance [81] performed best in simu- 
lations. However, it remains very challenging to develop a pipeline 
to detect positive selection that is robust to errors in the sequences 
or alignments. Instead we advise to carefully check the alignments 


of genes that are putatively under selection by any method 
described here. 


Mutation rate variation can also cause genomic regions to have 
different substitution rates without any change in fixation rate. 
Recent studies of guanine and cytosine (GC)-isochores in the 
mammalian genome have suggested the importance of another 
selectively neutral evolutionary process that affects nucleotide evo- 
lution. As described in the work of Laurent Duret and others (see 
[82, 83]), biased gene conversion (BGC) is a mechanism caused by 
the mutagenic effects of recombination combined with the prefer- 
ence in recombination-associated DNA repair toward strong 
(GC) versus weak (adenine and thymine [AT]) nucleotide pairs at 
non-Watson-Crick heterozygous sites in heteroduplex DNA during 
crossover in meiosis. Thus, beginning with random mutations, 
BGC results in an increased probability of fixation of G and C 
alleles. In particular, methods looking for accelerated regions in 
coding DNA but also codon models cannot distinguish positive 
selection from BGC (see [84, 85]). Therefore, the putatively 
selected genes should be checked for GC content and closeness to 
recombination hotspots and telomeres. 

Most codon models assume a single phylogeny and a constant 
synonymous rate among sites, implying that rate variation among 
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4.3 Selection on 
Synonymous Sites 


codons is solely due to the variation of the nonsynonymous rate. 
Recent studies question whether such assumptions are generally 
realistic (e.g., see [86 ]) suggesting that failure to account for synon- 
ymous rate variation may be one of the reasons why LRTs for 
positive selection are vulnerable on data with high recombination 
rates. Some selection scans try to control this problem by checking 
putatively selected genes for recombination either manually or 
automated with traditional detection software (eg, RDP [87]). 
Also Drummond and Suchard [88] have recently developed a 
Bayesian approach to detect recombination within a gene. 

Another approach is to explicitly consider recombination. For 
example, Scheffler, Martin, and Seoighe [89] extended codon 
models with both dy and ds site variation and allowed changes of 
topology at the detected recombination breakpoints. Certainly, 
fast-evolving pathogens (such as viruses) undergo frequent recom- 
bination which often changes either the whole shape of the under- 
lying tree, or only the apparent branch lengths. While the efficiency 
of the approach depends on the success of inferring recombination 
breakpoints, the study demonstrated that taking into account alter- 
native topologies achieves a substantial decrease of false-positive 
inferences of selection while maintaining reasonable power. In 
principle the correlation structure of a collection of orthologous 
sequences can be fully described by a network known as an ancestral 
recombination graph (ARG). However, methods for ARG infer- 
ences have not been fast enough for practical use, and for applica- 
tions on large-scale genomic data, approximations are necessary 
(Rassmussen et al. [90]). 


Most selection studies to date focused on detecting selection on the 
protein, since synonymous changes are often presumed neutral and 
so unaffected by selective pressures. However, selection on synony- 
mous sites has been documented more than a decade ago. Codon 
usage bias is known to affect the majority of genes and species. In 
his seminal work, Akashi [91] demonstrated purifying selection on 
genes of Drosophila melanogaster, where strong codon bias favoring 
certain (optimal) codons serves to increase the translational accu- 
racy. Pressure to optimize for translational efficiency, robustness, 
and kinetics leads to synonymous codon bias, which was shown to 
widely affect mammalian genes [92], as well as genes of fast- 
evolving pathogens like viruses [93]. The standard approach to 
study selection on codon usage computes various codon adaptation 
indexes on full-length protein-coding genes (see [94] for review). 
More recently, methods to study selection on synonymous changes 
adopted more sophisticated approaches, mainly the following stra- 
tegies: (1) account for synonymous rate variation within sequences; 
(2) include codon fitness parameters within a modeling framework 
that connects population and intraspecific parameters; and 
(3) allow for selection on synonymous substitutions by introducing 
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the dependency on the rate of protein production and nonsense 
error rates. Below we elaborate on these approaches. 

In the past decade, evidence has accumulated to suggest that 
codon bias may vary not only between genomes and genes of the 
same genome but also within genes. Rather than just measuring 
codon biases in single sequences, a more powerful approach is to 
model evolution and selection across a set of homologous 
sequences. Taking the evolutionary perspective into account, 
Resch et al. [95] conducted a large-scale study of selection on 
synonymous sites in mammalian genes. They measured selection 
by comparing the average rate of synonymous substitutions (ds) to 
the average substitution rate in the corresponding introns (orl. 
While purifying selection was found to affect 28% of genes (ds/ 
dy < 1), 12% of genes were found to have been affected by positive 
selection on synonymous sites (ds/d; > 1). The signal of positive 
selection correlated with lower predicted mRNA stability compared 
to genes with negative selection on synonymous sites, suggesting 
that mRNA destabilization (affecting mRNA levels and translation) 
could be driving positive selection on synonymous sites. 

An increasing number of experimental studies exemplify differ- 
ent scenarios explaining how synonymous mutation may be 
affected by positive or negative selection. Codon bias to match 
skews of tRNA abundances may influence translation [96]. Changes 
at silent sites can disrupt splicing control elements and create new 
“cryptic” splice sites, as well as mRNA and transcript stability can be 
affected through preference or avoidance of certain sequence 
motifs (see [92, 97]). Silent changes may affect gene regulation 
via constraints for efficient binding of miRNA to sense mRNA 
(e.g., [92, 98]). Selection may act on the choice of synonymous 
codons near miRNA targets, improving the binding site accessibil- 
ity, binding efficiency and consequently the function of miRNA 
itself [99]. Programmed ribosomal frameshifting may be another 
reason for selection to act on specific codon positions [100]. Speed- 
dependent protein folding also has been proposed to be a result of 
selective pressure [101]. According to the co-translational protein 
folding hypothesis, slower production could cause the protein to 
take an altered final form (as has been shown in multidrug 
resistance-1, [102]). Finally, synonymous changes may act to mod- 
ulate expression by altering mRNA secondary structure, affecting 
protein abundance [103]. 

Models of codon evolution currently provide the most power- 
ful approach for studying selection on silent sites. Models with 
variable synonymous rates (see [64, 104]) have been used to evalu- 
ate the extent of variability of synonymous rates in a gene and to 
predict specific sites with most extreme—low or high—synony- 
mous rates (for example see [93]). A large-scale study of synony- 
mous rate variation [105] described some intriguing general 
patterns and showed that the phenomenon is widespread in 
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protein-coding genes. Genes displaying significantly varying synon- 
ymous rates increased association with several genetic diseases 
(especially cancers and diabetes) and were enriched for metabolic 
pathways. Other studies specifically focusing on human oncogenes 
revealed that a significant proportion of all cancer driver mutations 
were synonymous [106]. This suggests that synonymous rates 
cannot be automatically assumed fitness-neutral. Note that 
@ = Gud ga, an accepted measure of selection on the protein, is 
not designed to detect selection on synonymous codons, particu- 
larly when ds is assumed constant. Yet, some cautioned that low 
synonymous rates preserved by purifying selection might errone- 
ously lead to the detection of positive selection on the protein (e.g., 
Rubinstein et al. [107]). However, the usage of the œ ratio does not 
rely on the assumption that synonymous sites are neutral (pages 
58-59 of Yang [108]; and Section 6.3 of Anisimova and Liberles 
[109]); rather, it is defined as a ratio of two ratios, comparing the 
proportions of nonsynonymous and synonymous sites after and 
before selection has operated on the protein (@ = 1). In general 
we can assume that the evolutionary forces apply equally to synon- 
ymous and nonsynonymous sites. Forces that act differentially on 
synonymous and nonsynonymous sites should be rare in real data, 
but they can affect the validity of the o measure. The only known 
example of such a natural force is probably synonymous phasing, 
considered by Xing and Lee [110]. But even in this case, and with a 
worst case scenario, the estimated effect is very weak. More cru- 
cially, an adequate description of mutational processes at the DNA 
level allows to circumvent biases in the estimation of the o 
ratio [106]. 

Further testing, however, is necessary to decide whether any 
specific site has been affected by selection on synonymous codon 
usage. For example, Zhou, Gu, and Wilke [111] suggested distin- 
guishing two types of synonymous substitution rates: the rate of 
conserving synonymous changes dsc (between “preferred” codons 
or between “rare” codons) and the rate of non-conserving synony- 
mous changes dsy (between codons from the two different groups 
“rare” and “preferred”). Silent sites with dsn/dsc > 1 may be 
considered to be under positive selection, and significance can be 
tested based on a likelihood ratio test. Alternatively, synonymous 
rates at sites may be compared to the mean substitution rate in the 
corresponding intron, which can be implemented in a joint codon 
and DNA model, similar to the approach proposed by Wong and 
Nielsen [112]. 

Mutation-selection models include selective and mutational 
effects separately and allow estimating the fitness of various codon 
changes (see [113-115]). The relative rate of substitution for 
selected mutations to neutral mutations is given by œ = 2y/ 
(1 — e°”), where y = 2s is the scaled selection coefficient (see 
Exercise 3 for a derivation). Nielsen et al. [114] assumed that all 
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changes between preferred and rare codons have the same fitness 
(and so the same selection coefficient). They used one selection 
coefficient for optimal codon usage for each branch of a phylogeny 
and estimated these jointly with the œ ratio by ML. Using this 
approach to study ancestral codon usage bias, Nielsen et al. [114] 
confirmed the reduction in selection for optimal codon usage in 
D. melanogaster. In contrast, Yang and Nielsen [115] estimated 
individual codon fitness parameters and used them to estimate 
optimal codon frequencies for a gene across multiple species. LRT 
is used to test whether the codon bias is due to the mutational bias 
alone. Nevertheless, one remarkable contribution of the mutation- 
selection models is the connection they make between the interspe- 
cific and population parameters. Exploiting this further should 
provide insights to how changing demographic factors influence 
observed intraspecific patterns. Mutation-selection models also 
allow a new perspective on understanding codon models in the 
context of fitness landscapes with statistical implications as 
discussed in Subheading 4.2 of Chapter 13 by Jones, Susko, and 
Bielawski. 

Finally, it is also possible to study selection on synonymous 
changes by introducing a parametric relationship between fitness 
and protein production cost. The idea was first described by Gilchr- 
ist [116], who assumed that, in addition to mutation and drift, the 
codon bias evolved under selection to reduce the cost of nonsense 
errors. Protein production cost can be computed as a ratio of the 
expected cost to the expected benefit [117]. Kubatko and collea- 
gues [118] have extended a standard codon model to include the 
difference in protein production due to the usage of different 
codons (and therefore different elongation probabilities). How- 
ever, such a model requires position-specific instantaneous rate 
matrices, and consequently also the probability transition matrices, 
making the approach computationally very intensive. To circum- 
vent this, a GPU-based implementation was developed and used for 
phylogeny inference from 104 gene data set from Saccharomyces 
cerevisiae. Based on the standard model selection measure AIC, the 
new model outperformed the simplest model MO as well as the 
mutation-selection model FMutSel of Yang and Nielsen. 


Q1. Amino Acid and Codon Substitution Models 


How many parameters need to be estimated in the instantaneous 
rate matrix Q defining a reversible empirical AA model? How many 
such parameters are necessary to estimate for a reversible empirical 
codon model? How many parameters are to be estimated in both 
cases if a model is nonreversible? 


392 


Carolin Kosiol and Maria Anisimova 


Q2. Positive Selection Scans 


1. Go to the UCSC genome browser (http://genome.ucsc.edu). 
Search for the HAVCRI (hepatitis A virus cellular receptor 1) 
in the human genome (assembly GRCh38/hg38) belonging 
to the mammalian clade. The USCS genome browser tracks 
provide the summary of previous analysis of coding regions. 
Switch the “Cons_30_Primates” under “Comparative Geno- 
mics” to full and “refresh.” Why are only a few bases in the 
HAVCRI gene conserved according to the PhastCons track? 
Click on the “Cons_30 Primates” track to learn more about 
the conservation scores used. 


2. To retrieve the multiple sequence alignments for the HAVCRI 
gene, go to “Tools” and “Table Browser” at the top bar of the 
webpage. This will open a new page. Choose the table 
“ccdsGene” under the “Genes and Gene Predictions” group 
and “CCDS” track. Select “CDS FASTA alignment from mul- 
tiple alignment” option in the output format and “Show 
nucleotides” to download the aligned coding sequences of 
the HAVCRI gene. Alternatively you can retrieve the multiple 
alignments from Ensembl using BioMart. Here, you have 
options for more file formats including PHYLIP that is needed 
for the PAML software. 


3. Use the PAML software (http://abacus.gene.ucl.ac.uk/soft 
ware/paml.html) to test the models for positive selection on 
any lineage of the mammalian trees by comparing models Mla 
and M2a with a likelihood ratio test. 


4. Use PAML to identify sites under positive selection by using 
the Bayes Empirical Bayes approach. Do you find the same sites 
to be under selection as in Fig. 2 of Kosiol et al. [43]? 


Q3. Selection-Mutation Models 


Selection-mutation rely on a theoretical relationship between 
the nonsynonymous-synonymous rate ratio œ and the scaled selec- 
tion coefficient y = 2Ns. The probability that a new mutation 
eventually becomes fixed is 


Pr(fixation) = (1 — e **)/(1 — e4) = Asile *%’) 


if we assume that the selection coefficient s is small and N is large 
and represents the effective population size, which is constant in 
time (Kimura and Ohta [119]). Furthermore, assume that synony- 
mous substitutions are neutral and nonsynonymous have equal 
(and small) selection coefficients. Derive the relationship: 


w = 4s/(1 — est = 2y/(1 — e”) 


that combines phylogenetic with population genetic quantities and 
is crucial for mutation-selection models. 
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Model and Data 
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Abstract 


Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of 
protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian 
selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood 
(ML) framework is sometimes challenged with claims that the approach might too often support false 
conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties 
associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, 
(2) low information content, (3) the confounding of processes, and (4) phenomenological load, or 
PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often 
misdirected, or overstated, because they failed to recognize that the success of any model-based approach 
depends on the relationship between model and data. Here, we explore this relationship and provide a 
candid assessment of the limitations of CSMs to extract historical information from extant sequences. To 
aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process 
of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the 
ML statistical framework. We then divide the development of CSMs into two broad phases of scientific 
activity and show that the latter phase is characterized by increases in model complexity that can sometimes 
negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by 
the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, 
understanding the relationship between the data and a fitted model is a difficult task. We argue that the only 
way to properly understand that relationship is to perform in silico experiments using a generating process 
that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is 
presented as the basis of such a generating process. We contend that if complex CSMs continue to be 
developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in 
here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional 
inferential methods. 


Key words Codon substitution model, dN/dS, False positives, Maximum likelihood, Mechanistic 
model, Model misspecification, Mutation-selection model, Parameter confounding, Phenomenologi- 
cal load, Phenomenological model, Positive selection, Reliability, Statistical inference, Site-specific 
fitness landscape 
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Introduction 


Codon substitution models (CSMs) fitted to an alignment of 
homologous protein-coding genes are commonly used to make 
inferences about evolutionary processes at the molecular level (see 
Chapter 10 for examples of different applications of CSMs). Such 
processes (e.g., mutation and selection) are represented by a vector 
of parameters @ that can be estimated using maximum likelihood 
(ML) or Bayesian statistical methods. Here, we focus on ML and 
for convenience use CSM to indicate a model that is used in 
conjunction with the ML approach (see [21], for an example of 
the Bayesian approach). Considerable apprehension was expressed 
about the statistical validity of CSMs during their initial phase of 
development. In particular were concerns over the risk of falsely 
inferring that a sequence or codon site evolved by adaptive evolu- 
tion [11, 22, 23, 46, 60-63, 85]. Many of the studies employed in 
the critique of CSMs were later shown to be flawed due to statistical 
errors or incorrect interpretation of results [70, 72, 77, 84]. In 
their reanalysis of the iconic MHC dataset [24], for example, 
Suzuki and Nei [61] based their criticism of the ML approach on 
results that were incorrect due to computational issues [70]. And in 
simulation studies by Suzuki [60] and Nozawa et al. [46], the 
branch-site model of Yang and Nielsen [79] was criticized as 
being too liberal because it falsely inferred positive selection at 
32 out of 14,000 simulated sites, despite that this rate (0.0023) 
was well below the level of significance of the test (a = 0.05) 
[77]. Concerns about the ML approach were eventually mollified 
by numerous simulation studies showing that the false positive rate 
is no greater than the specified level of significance of the LRT 
under a wide range of evolutionary scenarios [2, 3, 29, 31, 37, 70, 
77, 82,85, 86]. The validity and success of the approach is now well 
established [84], and this has led to the formulation of CSMs of 
ever-increasing sophistication [31, 41, 48-50, 55, 64, 65]. 

The most common use of a CSM is to infer whether a given 
process, such as adaptive evolution somewhere in the gene, the 
fixation of double and triple mutations, or variations in the synon- 
ymous substitution rate, actually occurred when the alignment was 
generated. Several factors can potentially undermine the reliability 
of such inferences. These include: 


1. Model misspecification, which can result in biased parameter 
estimates; 


2. Low information content, which can cause parameter esti- 
mates to have large sampling errors and can lead to excessive 
false positive rates; 
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3. Confounding, which can cause patterns in the data generated 
by one evolutionary process to be attributed to a different 
process; 


4. Phenomenological load, which can cause a model parameter 
to be statistically significant even if the process it represents did 
not actually occur when the data was generated. 


These same factors can impact any model-based effort to make 
inferences from data generated by complex biological processes, 
not only to the CSMs described here. The possibility of false 
inference due to any combination of these factors does not imply 
that the CSM approach is unreliable in principle. As has been 
demonstrated by numerous successful applications, CSMs generally 
extract accurate and useful information provided that the model is 
well suited for the data at hand [1, 71, 76]. We maintain that the 
validity of inferences is not a function of the model in and of itself, 
but is a consequence of the relationship between the model and 
the data. 

Here, we explore this relationship via case studies taken from 
the historical development of CSMs. Our objective is to be candid 
about the limitations of CSMs to reliably extract information from 
an alignment. But, we emphasize that the impact of these limita- 
tions De, false positives and confounding) is a consequence of a 
mismatch between the parameters included in the model and the 
often limited information contained in the alignment. The case 
studies are divided into two parts, each corresponding to a distinct 
phase in the development of CSMs. Phase I is characterized by 
pioneering efforts to formulate CSMs to account for the most 
prominent components of variation in an alignment 
[16, 42]. These include the M-series models that were among the 
first CSMs to account for variations in selection effects across sites 
[81], and the branch-site model of Yang and Nielsen [79] (hereaf- 
ter, YN-BSM) formulated to account for variations in selection 
effects across both sites and branches. The first pair of case studies 
exemplifies concerns about the impact of low information content 
(Case Study A) and model misspecification (Case Study B) on the 
probability of falsely detecting positive selection in a gene or at a 
particular codon site. We also include a description of methods 
recently developed to mitigate the problem of false inference. 

Phase II in the historical development is characterized by the 
general increase in the complexity of CSMs aimed to account for 
more subtle components of variation in an alignment.’ Models 
used to detect temporal changes in site-specific selection effects 


l The original CSM proposed by Goldman and Yang [16] was in fact quite complex in that it adjusted substitution 
rates between nonsynonymous codons to account for differences in physicochemical properties using the 
Grantham matrix [17]. This approach was later abandoned in favor of the simpler formulation now known as 
MO [44], e.g., the first M-series model [81]. 
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(e.g., [18, 31, 55]) or “heterotachy” [36] are representative. The 
movement toward complex parameter-rich models has resulted in a 
new set of concerns that are not yet widely appreciated. Principal 
among these is an increase in the possibility of confounding. Two 
components of the alignment-generating process are confounded if 
they can produce the same or similar patterns in the data. Such 
components can be impossible to disentangle without the input of 
further biological information, and their existence can lead to a 
statistical pathology that we call phenomenological load (PL). The 
second pair of case studies illustrates the possibility of false infer- 
ence due to confounding (Case Study C) and PL (Case Study D). 
An essential feature of these studies is the use of a much more 
realistic generating model to produce alignments for the purpose 
of model evaluation. 

Recent discoveries made using the mutation-selection (MutSel; 
[80]) framework of Halpern and Bruno [19], which is based on a 
realistic approximation of population dynamics at individual codon 
sites, have challenged the way we think about the relationship 
between parameters of traditional CSMs and components of the 
process of molecular evolution they are meant to summarize (e.g., 
[25, 26, 56, 57]). Previously, there has been a tendency to think 
about alignment-generating processes as if they occur in the same 
way they are modeled by a CSM. This way of thinking can be 
misleading because mechanisms of protein evolution can differ in 
important and substantial ways from traditional CSMs. To redress 
this issue, we begin this chapter with a brief overview of the 
conceptual foundations of MutSel as a more realistic way of think- 
ing about the actual process of molecular evolution. This material is 
followed by a novel presentation of the ML statistical framework 
intended to illustrate potential limitations in what can reasonably 
be inferred when a CSM is fitted to data. 


2 Conceptual Foundations 


2.1 How Should We 
Think About the 
Alignment-Generating 
Process? 


A codon substitution model represents an attempt to explain the 
way a target protein-coding gene changed over time by a combina- 
tion of mutation, selection (purifying as well as adaptive), and drift. 
Adaptive evolution occurs at each site within a protein in response 
to a hierarchy of effects, including, but not limited to, changes in 
the network of the protein’s interactions, changes in the functional 
properties of that network, and changes in both the cellular and 
organismal environment over time. The result of the complex 
interplay between these effects is typically viewed through the 
narrow lens of an alignment of homologous sequences X obtained 
from extant species, possibly accompanied by a tree topology t (for 
our purposes, it is always assumed that t is known). The informa- 
tion contained in X is evidently insufficient to resolve all of the 
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effects of the true generating process, which would in any case be 
difficult or even impossible to parameterize with any accuracy. It is 
therefore necessary to base the formulation of a CSM on a number 
of simplifying assumptions. The usual assumptions include that: 


1. Sites evolved independently; 


2. Each site evolved via a homogenous substitution process over 
the tree (formally, by a Markov process governed by a substitu- 
tion rate matrix Q); 


3. The selection regime at a site is determined by Q; drawn from a 
small set of possible substitution rate matrices {Q), ..., Q}; 


4. All sites share a common vector of stationary frequencies and 
evolved via a common mutation process. 


The elements q;; of a substitution rate matrix Qare typically defined 
for codons 7 Æ jas follows [44]: 


0 if z and J differ by more than one nucleotide 
Tj for synonymous transversions 
Qi; = § Kr for synonymous transitions 


ær; for nonsynonymous transversions 


o: for nonsynonymous transitions 
(1) 


where x is the transition bias and z; is the stationary frequency of 
the zth codon, both assumed to be the same for all codon sites. The 
ratio o = dN/dS of the nonsynonymous substitution rate ÆN to 
the synonymous substitution rate dS (both adjusted for “opportu- 
nity””) quantifies the stringency of selection at the site, with values 
closer to zero corresponding to sites that are more strongly con- 
served. We follow standard notation and use o to represent the 
maximum likelihood estimate (MLE) of w obtained by fitting Eq. 1 
to an alignment. 

Equation | provides the building block for most CSMs, yet it is 
unsuitable as a means to think about the substitution process at a 
site. For instance, the rate ratio in Eq. 1 is assumed to be the same 
for all nonsynonymous pairs of codons. If interpreted mechanisti- 
cally, this is tantamount to the assumption that the amino acid 
occupying a site has fitness fand all other amino acids have fitness 
f+ af, and that, with each substitution, the newly fixed amino acid 
changes its fitness to fand the previous occupant changes it fitness 


? Single-nucleotide (SN) mutations that are nonsynonymous occur more frequently than those that are synony- 
mous due to idiosyncrasies in the genetic code. This is accounted for in the formulation of dN and dS, so that dN 
can be interpreted as the proportion of nonsynonymous SN mutations that are fixed. Likewise, dS is the 
proportion of synonymous SN mutations that are fixed. See Jones et al. [25] for a discussion of various 
interpretations of dN/dS. 
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2.2 What Is the 
Objective of Model 
Building? 


to f+ df. Such a narrow view of the substitution process, akin to 
frequency-dependent selection [6, 25], is conceptually misleading 
for the majority of proteins. To be clear, CSMs are undoubtedly a 
valuable tool to make inferences about the evolution of a protein 
(e.g., [8, 52, 71, 76]); our point is that they do not necessarily 
provide the best way to think about the process. 

The way we think about the substitution process should not be 
limited to unrealistic assumptions used to formulate a tractable 
CSM. It is more informative to conceptualize evolution at a 
codon site using the traditional metaphor of a fitness landscape 
upon which greater height represents greater fitness as depicted in 
Fig. 1. If sites are assumed to evolve independently, a site-specific 
fitness landscape can be defined for the /th site by a vector of 
fitness coefficients f” and its implied vector of equilibrium codon 
frequencies x”. Combined with a model for the mutation process, 
x” determines the evolutionary dynamics at the site, or the way it 
“moves” over its landscape (more formally, the way mutation and 
fixation events occur at a codon site in a population over time). This 
provides a way to think about evolution at a codon site in terms of 
three possible dynamic regimes: shifting balance, under which the 
site moves episodically away from the peak of its fitness landscape 
(De, the fittest amino acid) via drift and back again by positive 
selection (Fig. la); adaptive evolution, under which a change in 
the landscape is followed by movement of the site toward its new 
fitness peak (Fig. 1b); and neutral or nearly neutral evolution, 
under which drift dominates and the site is free to move over a 
relatively flat landscape limited primarily by biases in the mutation 
process. This way of thinking about the alignment-generating pro- 
cess is encapsulated by the MutSel framework [6, 7, 25]. The 
precise relationship between the MutSel framework and the three 
dynamic regimes will be presented in Case Study C. 


CSMs have become increasingly complex with the addition of more 
free parameters since the introduction of the M-series models in 
Yang et al. [81]. The prima facie objective of this trend is to 
produce models that provide better mechanistic explanations of 
the data. The assumption is that this will lead to more accurate 
inferences about evolutionary processes, particularly as the volume 
of genetic data increases [35]. However, the significance of a new 
model parameter is assessed by a comparison of site-pattern distri- 
butions without reference to mechanism. Combined with the pos- 
sibility of confounding, this feature of the ML framework means 
that the objective of improving model fit does not necessarily 
coincide with the objective of providing a better representation of 
the mechanisms of the true generating process. 

Given any CSM with parameters ĝm, it is possible to compute a 
vector P that assigns a probability to each of the 61% possible site 
patterns for an N-taxon alignment (i.e., a multinomial distribution 
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Fig. 1 It can be useful to think of the substitution process at a site as movement 
on a site-specific fitness landscape. The horizontal axis in each figure shows the 
amino acids at a hypothetical site in order of their stationary frequencies 
indicated by the height of the bars. Frequency is a function of mutation and 
selection, but can be construed as a proxy for fitness. The site-specific dN/dS 
ratio [25] is a function of the amino acid that occupies the site, and can be <1 
(left of the red dashed line) or >1 (right of the dashed red line). (a) Suppose 
phenylalanine (F, TTT) is the fittest amino acid. The site-specific dd ratio is 
much less than one when occupied by F because any nonsynonymous mutation 
will always be to an amino acid that is less fit. Nevertheless, it is possible for an 
amino acid such as valine (V, GTT) to be fixed on occasion, provided that 
selection is not too stringent. When this happens, dN/dS at the site is 
temporarily elevated to a value greater than one as positive selection moves 
the site back to F by a series of replacement substitutions, e.g., V (GTT) — G 
(GGT) — C (TGT) — F (TTT). We call the episodic recurrence of this process 
shifting balance on a static fitness landscape. Shifting balance on a landscape 
for which all frequencies are approximately equal corresponds to nearly neutral 
evolution (not depicted), when dd is always ~1. (b) Now, consider what 
happens following a change in one or more external factors that impact the 
functional significance of the site. The relative fitnesses of the amino acids might 
change from that depicted in a to that in b for instance, where glutamine (Q) is 
fittest. If at the time of the change the site is occupied by F (as is most likely), 
then dN/dS would be temporarily elevated as positive selection moves the site 
toward its new peak at Q, e.g., F (TTT) — Y (TAT) — H (CAT) — Q (CAA). This 
process of adaptive evolution is followed by a return to shifting balance once 
the site is occupied by Q 


for 61% categories). We refer to P= Py(@q) as the site-pattern 
distribution for that model. Figure 2 depicts the space of all possi- 
ble site-pattern distributions for an N-taxon alignment. Each ellipse 
represents the family of distributions {Pm(0m)|0m E Qm}, where 
Qm is the vector space of all possible values of Om. For example, 
{Puo(9@mo)|O@m0 E Qmo} is the family of distributions that can be 


406 


Christopher T. Jones et al. 


TC 


Fig. 2 The (61" — 1)-dimensional simplex containing all possible site-pattern 
distributions for an N-taxon alignment is depicted. The innermost ellipse repre- 
sents the subspace {Pmo(0mo)|0mo © Qmo} that is the family of distributions that 
can be specified using MO, the simplest of CSMs. This is nested in the family of 
distributions that can be specified using M1 (blue ellipse), a hypothetical model 
that has the same parameters as MO plus some extra parameters. Similarly, M1 
is nested in M2 (red ellipse). Whereas models are represented by subspaces 
of distributions, the true generating process is represented by a single point 
Pop, the location of which is unknown. The empirical site-pattern distribution 
Ps(@s) corresponds to the saturated model fitted to the alignment; with 
large samples, Ps(@s) + Pep. For any other model M, the member 
Pu(Om)<€{Pu(Om) | Om E€ Qu} most consistent with X is the one that mini- 
mizes deviance, which is twice the difference between the maximum 
log-likelihood of the data under the saturated model and the maximum 
log-likelihood of the data under M 


specified using MO, the simplest CSM that assumes a common 
substitution rate matrix Q for all sites and branches. This is nested 
inside {Pyi(9m1)|Om1 E Qui}, where M1 is a hypothetical model 
that is the same as MO but for a few extra parameters. Likewise, M1 
is nested in M2. The location of the site-pattern distribution for the 
true generating process is represented by Ppa. Its location is fixed 
but unknown. It is therefore not possible to assess the distance 
between it and any other distribution. Instead, comparisons are 
made using the site-pattern distribution inferred under the 
saturated model. 

Whereas a CSM {Pm(0m)|0m E Qu} can be thought of as a 
family of multinomial distributions for the 61% possible site pat- 
terns, the fitted saturated model Ps(0s) is the unique distribution 
defined by the MLE fe = (y,/n,...,,,/m)", where y; > 0 is the 
observed frequency of the ab site pattern, m is the number of 
unique site patterns, and # is the number of codon sites. In other 
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words, the fitted saturated model is the empirical site-pattern dis- 
tribution for a given alignment. Because it takes none of the 
mechanisms of mutation or selection into account, ignores the 
phylogenetic relationships between sequences, and excludes the 
possibility of site patterns that were not actually observed (i.e., 
y,/n = 0 for site patterns i not observed in X), Ps(@s) can be 
construed as the maximally phenomenological explanation of the 
observed alignment. An alignment is always more likely under the 
saturated model than it is under any other CSM. Ps(@s) therefore 
provides a natural benchmark for model improvement. 

For any alignment, the MLE over the family of distributions 
{Pu(O@m)|Om E Qm} is represented by a fixed point Puff) in 
Fig. 2. Pm(Ôm) is the distribution that minimizes the statistical 
deviance between Puff) and Ps(4s). Deviance is defined as twice 
the difference between the maximum log-likelihood (LL) of the 
data under the saturated model and the maximum log-likelihood of 
the data under M: 


D(6m,0s) = 2{€(8s | X) — £(Ôm | X)} (2) 


A key feature of deviance is that it always decreases as more para- 
meters are added to the model, corresponding to an increase in the 
probability of the data under that model. For example, suppose 
{Py2(Om2)|\Om2 E Qm2} is the same family of distributions as 
{Pm (0m Hëv E Qui} but for the inclusion of one additional 
parameter y, so that 0m2 = (0m1, y). The improvement in the 
probability of the data under Pm2(m2) over its probability under 
Py (Ôm) is assessed by the size of the reduction in deviance 
induced by w: 


AD(Ôm fuz) = Div, 9s) — D(Om2, 9s) 
= 2{€(Omp | X) — të | X)} (3) 


Equation 3 is just the familiar log-likelihood ratio (LLR) used to 
compare nested models under the maximum likelihood framework. 

Given this measure of model improvement, the de facto objec- 
tive of model building is not to provide a mechanistic explanation 
of the data that more accurately represents the true generating 
process, but only to move closer to the site-pattern distribution of 
the fitted saturated model. Real alignments are limited in size, so 
there will always be some distance between Ps(@s) and Pop due to 
sampling error (as represented in Fig. 2). But even with an infinite 
number of codon sites, when Ps (As) converges to Pon, the criterion 
of minimizing deviance does not inevitably lead to a better expla- 
nation of the data because of the possibility of confounding. Two 
processes are said to be confounded if they can produce similar 
patterns in the data. Hence, if y represents a process E that did not 
actually occur when the data was generated, and if E is confounded 
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with another process that did occur, the LLR in Eq. 3 can still be 
significant. Under this scenario, the addition of y to M1 would 
engender movement toward Ps (ôs) and Pep, but the new model 
M2 would also provide a worse mechanistic explanation of the data 
because it would falsely indicate that E occurred. The possibility of 
confounding and its impact on inference is demonstrated in Case 
Study D. 


3 Phase I: Pioneering CSMs 


3.1 Case Study A: 
Low Information 
Content 


The first effort to detect positive selection at the molecular level 
[24] relied on heuristic counting methods [43]. Phase I of CSM 
development followed with the introduction of formal statistical 
approaches based on ML [16, 42]. The first CSMs were used to 
infer whether the estimate @ of a single nonsynonymous to synon- 
ymous substitution rate ratio averaged over all sites and branches 
was significantly greater than one. Such CSMs were found to have 
low power due to the pervasiveness of synonymous substitutions at 
most sites within a typical gene [76]. An early attempt to increase 
the statistical power to infer positive selection was the CSM 
designed to detect @>1 on specific branches [78]. Models 
accounting for variations in œ across sites were subsequently devel- 
oped, the most prominent of which are the M-series models 
[78, 81]. These were accompanied by methods to identify individ- 
ual sites under positive selection. The quest for power culminated 
in the development of models that account for variations in the rate 
ratio across both sites and branches. The appearance of various 
branch-site models (e.g., [4, 10, 79, 86]) marks the end of Phase 
I of CSM development. 

Two case studies are employed in this section to illustrate some 
of the inferential challenges associated with Phase I models. We use 
Case Study A to examine the impact of low information content on 
the inference of positive selection at individual codon sites. The 
subject of this study is the Mla vs M2a model contrast applied 
to the tax gene of the human T-cell lymphotropic virus type I 
(HTLV-I; [63, 82]). We use Case Study B to illustrate how 
model misspecification (i.e., differences between the fitted model 
and the generating process) can lead to false inferences. The subject 
of this study is the Yang—Nielsen branch-site model (YN-BSM; 
[79 ]) applied to simulated data. 


To study the impact of low information content on inference, we 
use a pair of nested M-series models known as Mla and M2a 
[70, 82]. Under Mla, sites are partitioned into two rate-ratio 
categories, 0 < œo < l and on = l in proportions pọ and p) = 1 
— po. M2a includes an additional category for the proportion of 
sites p2 = 1 — po — po that evolved under positive selection with 
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@z > 1. The use of multiple categories permits two levels of infer- 
ence. The first is an omnibus likelihood ratio test (LRT) for evi- 
dence of positive selection somewhere in the gene, which is 
conducted by contrasting a pair of nested models. For example, 
the contrast of Mla vs M2a is made by computing the distance 
LLR = AD(Ômia, 9a) between the two models and comparing 
the result to the limiting distribution of the LLR under the null 
model. In this case, the limiting distribution of LLR is often taken 
to be y2 [75], which would be correct under regular likelihood 
theory because the models differ by two parameters. The second 
level of inference is used to identify individual sites that underwent 
positive selection. This is conducted only if positive selection is 
inferred by the omnibus test (e.g., if LLR > 5.99 for the Mla vs 
M2a contrast at the 5% level of significance). Let co, e, and e 
represent the event that a given site pattern x falls into the stringent 
(0 < @o <1), neutral (on = 1), or positive (oz > 1) selection 
category, respectively. Applying Bayes’ rule: 

Pre | x, Aya) = me | 25 Bus lä: - (4) 

Zafra! Ck, Omaa) Pe 


Sites with a sufficiently high posterior probability (e.g., 


Pr(c2 | x,Om2a) > 0.95 ) are inferred to have undergone positive 
selection. Equation 4 is representative of the naive empirical Bayes 
(NEB) approach under which MLEs (vz, 1 are used to compute 
posterior probabilities. 

The NEB approach ignores potential errors in parameter esti- 
mates that can lead to false inference of positive selection at a site 
(i.e., a false positive). The resulting false positive rate can be espe- 
cially high for alignments with low information content. An exam- 
ple setting with low information content arises when there are a 
substantial number of invariant sites, since these provide little 
information about the substitution process. The issue of low infor- 
mation content is well illustrated by the extreme case of the tax 
gene, HTLV-I [63]. The alignment consists of 20 sequences with 
181 codon sites, 158 of which are invariant. The 23 variable sites 
have only one substitution each: 2 are synonymous and 21 are 
nonsynonymous. The high ratio of nonsynonymous-to-synony- 
mous substitutions suggests that the gene underwent positive 
selection. This hypothesis was supported by analytic results: the 
LLR for the Mla vs M2a contrast was 6.96 corresponding to a p- 
value of approximately 0.03 [82]. The omnibus test therefore 
supported the conclusion that the gene underwent positive selec- 
tion. However, the MLE for p under M2a was p, = 1. Using this 
value in Eq. 4 gives Pries | x,ĝm2a) = 1 for all sites, including the 
158 invariable sites. Such an unreasonable result can occur under 
NEB because, despite the possibility of large sampling errors in 
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MLEs due to low information, Ora is treated as a known value in 
Eq. 4. 

Bayes empirical Bayes (BEB; [82]), a partial Bayesian approach 
under which rate ratios and their corresponding proportions are 
assigned discrete prior distributions (cf. [21]), was proposed as an 
alternative to NEB. Numerical integration over the assumed priors 
tends to provide better estimates of posterior probabilities, partic- 
ularly in cases where information content is low. Using BEB in the 
analysis of the tax gene, for example, the posterior probability was 
0.91 < Pr(c | wua) < 0.93 for the 21 sites with a single non- 
synonymous change and 0.55 < Pr(c3 | xÔm2a) < 0.61 for the 
remaining sites [82]. Hence, the BEB approach mitigated the 
problem of low information content, as the posterior probability 
of positive selection at invariant sites was reduced. An alternative to 
BEB is called smoothed bootstrap aggregation (SBA) [38]. SBA 
entails drawing site patterns from X with replacement (i.e., boot- 
strap) to generate a set of alignments {X}, ..., Xm} with similar 
information content as X. The MLEs J.-J, for the vector of 
model parameters 0 are then estimated by fitting the CSM to each 
X; €{X, ..., Xm}. A kernel smoother is applied to these values to 
reduce sampling errors. The mean value of the resulting smoothed 


{0;}._, is then used in Eq. 4 in place of the MLE for 6 obtained 
from the original alignment to estimate posterior probabilities. This 
approach was shown to balance power and accuracy at least as well 
as BEB. But, SBA has the advantage that it can accommodate the 
uncertainty of all parameter estimates (not just those of the o 
distribution, as in BEB) and is much easier to implement. When 
SBA was applied to the tax gene, the posterior probabilities for positive 
selection were further reduced: 0.87 < Pr(c | x,0M2a) < 0.89 
for the 21 sites with a single nonsynonymous change, and 0.55 < 
Pri e | x,OM2a) < 0.60 for the remaining sites [38]. 

The problem of low information content was fairly obvious in 
the case of the tax gene, as 158 of the 181 codon sites within that 
dataset were invariant. However, it can sometimes be unclear 
whether there is enough variation in an alignment to ensure reliable 
inferences. It would be useful to have a method to determine 
whether a given data set might be problematic. An MLE @ will 
always converge to a normal distribution centered at the true 
parameter value @ with variance proportional to 1/7 as the sample 
size n (a proxy for information content) gets larger, provided that 
the CSM satisfies certain “regularity” conditions (a set of technical 
conditions that must hold to guarantee that MLEs will converge in 
distribution to a normal, and that the LLR for any pair of nested 
models will converge to its expected chi-squared distribution). This 
expectation makes it possible to assess whether an alignment is 
sufficiently informative to obtain the benefits of regularity. The 


3.1.1 Irregularity and 
Penalized Likelihood 
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first step is to generate a set of bootstrap alignments {X}, ..., X,,}. 
The CSM \ can then be fitted to these to produce a sample distribu- 
tion {ĝ; E , for the MLE of any model parameter 0. If the align- 
ment is sufficiently informative with respect 0, then a histogram of 


10.17, should be approximately normal in distribution. Serious 
departures from normality (e.g., a bimodal distribution) indicate 
unstable MLEs, which are a sign of insufficient information or an 
irregular modeling scenario. Mingrone et al. [38] recommend 
using this technique with real data as a means of gaining insight 
into potential difficulties of parameter estimation using a 
given CSM. 


Issues associated with low information content can be made worse 
by violations of certain regularity conditions. For example, M2a is 
the same as Mla but for two extra parameters, pọ and oz, Usual 
likelihood theory would therefore predict that the limiting distri- 
bution of the LLR is 74. However, this result is valid only if the 
regularity conditions hold. Among these conditions is that the null 
model is not obtained by placing parameters of the alternate model 
on the boundary of parameter space. Since Mla is the same as M2a 
but with p, = 0, this condition is violated. The same can be said for 
many nested pairs of Phase I CSMs, such as M7 vs M8 [81] or M1 
vs branch-site Model A [79]. Although the theoretical limiting 
distribution of the LLR under some irregular conditions has been 
determined by Self and Liang [54], those results do not include 
cases where one of the model parameters is unidentifiable under the 
null [2]. Since Mla is M2a with p = 0, the likelihood under Mla 
is the same for any value of oz. This makes oz unidentifiable under 
the null. The limiting distribution for the Mla vs M2a contrast is 
therefore unknown [74]. 

A penalized likelihood ratio test (PLRT; [39]) has been pro- 
posed to mitigate problems associated with unidentifiable para- 
meters. Under this method, the likelihood function for the 
alternate model (e.g., M2a) is modified so that values of pọ closer 
to zero are penalized. This has the effect of drawing the MLE for p2 
away from the boundary, and can interpreted as a way to “regular- 
ize” the model. PLRT seems to be more useful in cases where 
the analysis of a real alignment produces a small value of p, accom- 
panied by an unrealistically large value of @2. This can happen 
because @2 is influenced by fewer and fewer site patterns as p, 
approaches zero, and is therefore subject to larger and larger sam- 
pling errors. In addition, oz and p, tend to be negatively correlated, 
which further contributes to the large sampling errors. For exam- 
ple, Mingrone et al. [39] found that M2a fitted to a 5-taxon 
alignment with 198 codon sites without penalization gave 
(p,,@2) = (0.01,34.70). These MLEs, if taken at face value, sug- 
gest that a small number of sites in the gene underwent positive 
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3.2 Case Study B: 
Model 
Misspecification 


selection. However, such a large rate ratio is difficult to believe 
given that its estimate is consistent with only approximately 
2 codon sites (e.g., an estimated 1% of the 198 sites or ~2 sites). 
Using the PLRT, the MLEs were (p,,@2) = (0.09,1.00). These 
suggest that selection pressure was nearly neutral at a significant 
proportion of sites in the gene. In this case, the rate ratio is 
consistent with 9% of the 198 sites or %18 sites and is therefore 
less likely to be an artifact of sampling error. We expect this 
approach to be useful in a wide variety of evolutionary applications 
that rely on mixture models to make inferences (e.g., [13, 34, 
47, 66]). 

Other approaches for dealing with low information content in 
the data for an individual gene include the empirical Bayes approach 
of Kosiol et al. [33] and the parametric bootstrapping methods of 
Gibbs [14]. Both methods exploit the additional information con- 
tent available from other genes. Kosiol et al. [33] adopted an 
empirical Bayes approach, where o values varied over edges and 
genes according to a distribution. Because empirical posterior dis- 
tributions are used, the approach is more akin to detecting sites 
under positive selection (e.g., using NEB) than formal testing. By 
contrast, Gibbs [14] adopted a test-based approach and utilized 
parametric bootstrapping [15] to approximate the distribution of 
the likelihood ratio statistic using data from other genes to obtain 
parameter sets to use in the bootstrap. Whereas this approach can 
attenuate issues associated with low information content, it can also 
be computationally expensive, especially when applied to large 
alignments. 


The mechanisms that give rise to the diversity of site patterns in a 
set of homologous genes are highly complex and not fully under- 
stood. CSMs are therefore necessarily simplified representations of 
the true generating process, and are in this sense misspecified. The 
extent to which misspecification might cause an omnibus LRT to 
falsely detect positive selection was of primary concern during 
Phase I of model development. We use a particular form of the 
YN-BSM called Model A [79] to illustrate this issue. In its original 
form, the omnibus LRT assumes a null under which a proportion 
Po of sites evolved under stringent selection with wp = 0 and the 
remaining sites evolved under a neutral regime with ou = 1 on all 
branches of the tree (Ge, model M1 in [44]). This is contrasted 
with Model A, which is the same as M1 except that it assumes that 
some stringent sites and some neutral sites evolved under positive 
selection with œ > 1 on a prespecified branch called the fore- 
ground branch. The omnibus test contrasting M1 with Model A 
was therefore designed to detect a subset of sites that evolved 
adaptively on the same branch of the tree. 

During this period of model development, the standard 
method to test the impact of misspecification on the reliability of 


Table 1 
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Rate ratios (œ) for regimes X and Z taken from Zhang [85] 


Sites 1-20 21-40 41-60 61-80 81-100 101-120 121-140 141-160 161-180 181-200 


æ regime X 1.00 1.00 
æ regime Z 1.00 1.00 


0.80 0.80 0.50 0.50 0.20 0.20 0.00 0.00 
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 


an omnibus LRT was to generate alignments in silico using a more 
complex version of the CSM to be tested as the generating model. 
This usually involved adding more variability in o across sites 
and/or branches than assumed by the fitted CSM while leaving 
all other aspects of the generating model the same. In Zhang [85], 
for example, alignments were generated using site-specific rate 
matrices, as in Eq. 1, with rate ratios œ specified by predetermined 
selection regimes, two of which are shown in Table 1. In one 
simulation, 200 alignments were generated using regime Z on a 
single foreground branch and regime X on all of the remaining 
branches of a 10 or 16 taxon tree. The gene therefore underwent a 
mixture of stringent selection and neutral evolution over most of 
the tree (regime X), but with complete relaxation of selection 
pressure on the foreground branch (regime Z). Positive selection 
did not occur at any of the sites. Nevertheless, the M1 vs Model A 
contrast inferred positive selection in 20-55% of the alignments, 
depending on the location of the foreground branch. Such a high 
rate of false positives was attributed to the mismatch between the 
process used to generate the data compared to the process assumed 
by the null model M1 [85]. 

The branch-site model was subsequently modified to allow 
0 < @ < 1 instead of wọ = 0 (Modified Model A in [86]). Fur- 
thermore, the new null model is specified under the assumption 
that some proportion Dn of sites (the stringent sites) evolved under 
stringent selection with 0 < wp < 1 everywhere in the tree except 
on the foreground branch, where those same sites evolved neutrally 
with oz = 1. All other sites in the alignment (the neutral sites) are 
assumed to have evolved neutrally with on = 1 everywhere in the 
tree. This is contrasted with the Modified Model A, which assumes 
that some of the stringent sites and some of the neutral sites evolved 
under positive selection with oz > 1 on the foreground. Hence, 
unlike the original omnibus test that contrasts M1 with Model A, 
the new test contrasts Modified Model A with oz = 1 against 
Modified Model A with œ > 1. These changes to the YN-BSM 
were shown to mitigate the problem of false inference. For exam- 
ple, using the same generating model with regimes X and Z, the 
modified omnibus test falsely inferred positive selection in only 
1-7.5% of the alignments, consistent with the 5% level of signifi- 
cance of the test [86]. 
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This case study demonstrates how problems associated with 
model misspecification were traditionally identified, and how they 
could be completely corrected through relatively minor changes to 
the model. However, the generating methods employed by studies 
such as Zhang [85] and Zhang et al. [86], although sophisticated 
for their time, produced alignments that were highly unrealistic 
compared to real data. For example, it was recently shown that a 
substantial proportion of variation in many real alignments might 
be due to selection effects associated with shifting balance over 
static site-specific fitness landscapes [25, 26]. This process results 
in random changes in site-specific rate ratios, or heterotachy, that 
cannot be replicated using traditional CSMs as the generating 
model. While the mitigation of statistical pathologies due to low 
information content (e.g., using BEB or SBA) or model misspeci- 
fication (e.g., by altering the null and alternative hypotheses or the 
use of penalized likelihood) were critical advancements during 
Phase I of CSM development, other statistical pathologies went 
unrecognized due to reliance on unrealistic simulation methods. 
This issue is taken up in the next section. 


A Phase Il: Advanced CSMs 


A typical protein-coding gene evolves adaptively only episodically 
[59]. The evidence of adaptive evolution of this type can be very 
difficult to detect. For example, it is assumed under the YN-BSM 
that a random subset of sites switched from a stringent or neutral 
selection regime to positive selection together on the same set of 
foreground branches. The power to detect a signal of this kind can 
be very low when the proportion of sites that switched together is 
small [77]. Perhaps encouraged by the reliability of Phase I models 
demonstrated by extensive simulation studies [2, 3, 29, 31, 37, 70, 
77, 82, 85, 86], combined with experimental validation of results 
obtained from their application to real data [1, 71, 76], investiga- 
tors began to formulate increasingly complex and parameter-rich 
CSMs [31, 41, 48, 50, 55, 64, 65]. The hope was that carefully 
selected increases in model complexity would yield greater power 
to detect subtle signatures of positive selection overlooked by Phase 
I models. The introduction of such CSMs marks the beginning of 
Phase II of their historical development. 
Phase II models fall into three broad categories: 


1. The first consists of Phase I CSMs modified to account for 
more variability in selection effects across sites and branches 
than previously assumed, with the aim of increasing the power 
to detect subtle signatures of positive selection (e.g., the 
branch-site random effects likelihood model, BSREL; [31]). 
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2. The second category includes Phase I CSMs modified to con- 
tain parameters for mechanistic processes not directly asso- 
ciated with selection effects. Many such models have been 
motivated by a particular interest in the added mechanism 
(e.g., the fixation of double and triple mutations; [26, 40, 
83]), or by the notion that increasing the mechanistic content 
of a CSM can only improve inferences about selection effects 
(e.g., by accounting for variations in the synonymous substitu- 
tion rate; [30, 51]). 

3. The third category of models abandons the traditional formu- 
lation of Eq. 1 in favor of a substitution process expressed in 
terms of explicit population genetic parameters, such as popu- 
lation size and selection coefficients [45, 48-50, 64, 65]. 


An example of the first category of models is BSREL, which 
accounts for variations in selection effects across sites and over 
branches by assuming a different rate-ratio distribution 
{(@?,p’) :i=1, ...,ks)} for each branch b of a tree [31]. BSREL 
was later found to be more complex than necessary, so an adaptive 
version was formulated to allow the number of components k, on a 
given branch to adjust to the apparent complexity of selection 
effects on that branch (aBSREL; [55]). A further reduction in 
model complexity led to the formulation of the test known as 
BUSTED (for branch-site unrestricted statistical test for episodic 
diversification; [41]), which we use to illustrate the problem of 
confounding in Case Study C. An example of the second category 
of models is the addition of parameters for the rate of double and 
triple mutations to traditional CSMs, the most sophisticated ver- 
sion of which is RaMoSSwDT (for Random Mixture of Static and 
Switching sites with fixation of Double and Triple mutations; [26]). 
This model is used in Case Study D to illustrate the problem of 
phenomenological load. 

Models in the third category are the most ambitious CSMs 
currently in use, and are far more challenging to fit to real align- 
ments than traditional models. One of the most impressive exam- 
ples of their application is the site-wise mutation-selection model 
(swMutSel; [ 64, 65 ]) fitted to a concatenated alignment of 12 mito- 
chondrial genes (3598 codon sites) from 244 mammalian species. 
Based on the mutation-selection framework of Halpern and Bruno 
[19], swMutSel estimates a vector of selection coefficients for each 
site in an alignment. This and similar models (e.g., [48-50 ]) appear 
to be reliable [58], but require a very large number of taxa teg, 
hundreds). Phase II models of this category are therefore impracti- 
cal for the majority of empirical datasets. Here, we utilize MutSel as 
an effective means to generate realistic alignments with plausible 
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4.1 Case Study C: 


Confounding 


levels of variation in selection effects across sites and over time 
rather than as a tool of inference. 


By expressing the codon substitution process in terms of explicit 
population genetic parameters, the MutSel framework facilitates 
the investigation of complex evolutionary dynamics, such as shift- 
ing balance on a fixed fitness landscape or adaptation to a change in 
selective constraints (i.e., a peak shift; [6, 25 ]) that are missing from 
alignments generated using traditional methods. Specifically, by 
assigning a different vector of fitness coefficients for the 20 amino 
acids to each site, MutSel can generate more variation in rate ratio 
across sites and over time than has been realized in the past simula- 
tion studies (e.g., Table 1). In this way, MutSel provides the basis of 
a generating model that can be adjusted to produce alignments that 
closely mimic real data [26]. MutSel therefore serves to connect 
demonstrably plausible evolutionary dynamics to the pathology we 
refer to as confounding. 

Under MutSel, the dynamic regime at the /th codon site (e.g., 
shifting balance, neutral, nearly neutral, or adaptive evolution) is 
uniquely specified by a vector of fitness coefficients 


f” — "E. It is generally assumed that mutation to any of 
the three stop codons is lethal, so m = 61 for nuclear genes and 
m = 60 for mitochondrial genes. And, although it is not a require- 
ment, it is typical to assume that the f are constant across synony- 
mous codons [25, 57]. Given f’, the elements of a site-specific 
instantaneous rate matrix A” can be defined as follows for all i 4 j 
(cf. Eq. 1): 


ch 
Bi if sj, = 0 
h 
h Ss: 
Aij x ? otherwise (5) 


(heem seg 
á l— exp(—s4) 


where Mo is the rate at which codon 7 mutates to codon j and 
si, = 2 Nf A — f”) is the scaled selection coefficient for a popula- 


tion of haploids with effective population size N,. The probability 
that the new mutant 7 is fixed is approximated by 
sii!) — exp(—s%,)} [9, 28]. 

The rate matrix A” defines the dynamic regime for the site as 
illustrated in Fig. 3. The bar plot shows codon frequencies 
z” =n", ...,2", sorted in descending order. A site spends most of 
its time occupied by codons to the left or near the “peak” of its 
landscape. The codon-specific rate ratio for the site (dN’/dS? for 
codon 2) is low near the peak (red line plot in Fig. 3) since muta- 
tions away from the peak are seldom fixed. However, if selection is 
not too stringent, the site will occasionally drift to the right into the 
“tail” of its landscape. When this occurs, the codon-specific rate 
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Fig. 3 Fitness coefficients for the 20 amino acids were drawn from a normal 
distribution centered at zero and with standard deviation o = 0.001. Bars show 
the resulting stationary frequencies (a proxy for fitness) sorted from largest to 
smallest. They compose a metaphorical site-specific landscape over which the 
site is imagined to move. The solid red line shows the codon-specific rate ratio 
dN" /dS? for the sorted codons. This varies depending on the codon currently 
occupying the site, and can be greater than one following a chance substitution 
into the tail (to the right) of the landscape. In this case, the codon-specific rate 
ratio for the site ranged from 0.21 to 4.94 with a temporally averaged site- 
specific rate ratio of dN’/dS" = 0.52 


ratio will be elevated for a time until a combination of drift and 
positive selection moves the site back to its peak. This dynamic 
between selection and drift is reminiscent of Wright’s shifting 
balance. It implies that, when a population is evolving on a fixed 
fitness landscape (i.e., with no adaptive evolution), its gene 
sequences can nevertheless contain signatures of temporal changes 
in site-specific rate ratios (heterotachy), and that these might 
include evidence of transient elevation to values greater than one 
(i.e., positive selection). Such signatures of positive selection due to 
shifting balance can be detected by Phase II CSMs [25]. 

For example, BUSTED [41] was developed as an omnibus test 
for episodic adaptive evolution. The underlying CSM was formu- 
lated to account for variations in the intensity of selection over both 
sites and time modeled as a random effect. This is in contrast to the 
YN-BSM, which treats temporal changes in rate ratio as a fixed 
effect that occurs on a prespecified foreground branch (although 
the sites under positive selection are still a random effect). We 
therefore refer to the CSM underlying BUSTED as the random 
effects branch-site model (RE-BSM) to serve as a reminder of this 
important distinction. Under RE-BSM, the rate ratio at each site 
and branch combination is assumed to be an independent draw 
from the distribution {(@o, Po), (en, p1), (@2,p2)}. In this way, 
the model accounts for variations in selection effects both across 
sites and over time. BUSTED contrasts the null hypothesis that 
on < @ı < @2 = 1 with the alternative that wo zou < 1 < oa, 
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4.2 Case Study D: 
Phenomenological 
Load 


When applied to real data, rejection of the null is interpreted as 
evidence of episodic adaptive evolution. 

Unlike the YN-BSM that aims to detect a subset of sites that 
underwent adaptive evolution together on the same foreground 
branches (i.e., coherently), BUSTED was designed to detect het- 
erotachy similar to the type predicted by the mutation-selection 
framework: shifting balance on a static fitness landscape. Jones et al. 
[25] recently demonstrated that plausible levels of shifting balance 
can produce signatures of episodic positive selection that can be 
detected. BUSTED inferred episodic positive selection in as many 
as 40% of alignments generated using the MutSel framework. Sig- 
nificantly, BUSTED was correct to identify episodic positive selec- 
tion in these trials. Even though the generating process assumed 
fixed site-specific landscapes (so there was no episodic adaptive 
evolution), and the long-run average rate ratio at each site was 
necessarily less than one [57], positive selection nevertheless did 
sometimes occur by shifting balance. This illustrates the general 
problem of confounding. Two processes are said to be confounded 
if they can produce the same or similar patterns in the data. In this 
case, episodic adaptive evolution (i.e., the evolutionary response to 
changes in site-specific landscapes) and shifting balance (i.e., evolu- 
tion on a static fitness landscape) are confounded because they can 
both produce rate-ratio distributions that indicate episodic positive 
selection. The possibility of confounding underlines the fact that 
there are limitations in what can be inferred about evolutionary 
processes based on an alignment alone. 


Phenomenological load (PL) is a statistical pathology related to 
both model misspecification (Case Study B) and confounding 
(Case Study C) that was not recognized during Phase I of CSM 
development. When a model parameter that represents a process 
that played no role in the generation of an alignment (i.e., a mis- 
specified process) nevertheless absorbs a significant amount of vari- 
ation, its MLE is said to carry PL [26]. This is more likely to occur 
when the misspecified process is confounded with one or more 
other processes that did play a role in the generation of the data, 
and when a substantial proportion of the total variation in the data 
is unaccommodated by the null model [26]. PL increases the 
probability that a hypothesis test designed to detect the misspeci- 
fied process will be statistically significant (as indicated by a large 
LLR) and can therefore lead to the incorrect conclusion that the 
misspecified process occurred. Critically, Jones et al. [26] showed 
that PL was only detected when model contrasts were fitted to data 
generated with realistic evolutionary dynamics using the MutSel 
model framework. 

To illustrate the impact of PL, we consider the case of CSMs 
modified to detect the fixation of codons following simultaneous 
double and triple (DT) nucleotide mutations. The majority of 
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CSMs currently in use assume that codons evolve by a series of 
single-nucleotide substitutions, with the probability for DT 
changes set to zero. However, recent model-based analyses have 
uncovered evidence for DT mutations [32, 68, 83]. Early estimates 
of the percentage of fixed mutations that are DT were perhaps 
unrealistically high. Kosiol et al. [32], for example, estimated a 
value close to 25% in an analysis of over 7000 protein families 
from the Pandit database [69]. Alternatively, when estimates were 
derived from a more realistic site-wise mutation-selection model, 
DT changes comprised less than 1% of all fixed mutations 
[64]. More recent studies suggest modest rates of between 1% 
and 3% [5, 20, 27, 53]. Whatever the true rate, several authors 
have argued that it would be beneficial to introduce a few extra 
parameters into a standard CSM to account for DT mutations (e.g., 
[40, 83]). The problem with this suggestion is that episodic fixation 
of DT mutations can produce signatures of heterotachy consistent 
with shifting balance. 

Recall the comparison of M1, a CSM containing parameters 
represented by the vector 601, and M2, the same model but for the 
inclusion of one additional parameter y, so that 02 = (01, y). The 
parameter y will reduce the deviance of M2 compared to M1 by 
some proportion of the baseline deviance between the simplest 
CSM (MO) and the saturated model Ps(@s). We call this the percent 
reduction in deviance (PRD) attributed to y: 


 ADiëun, Ôm2) 
AD(@mo, 9s) 


PRD(y) (6) 


Suppose MI and M2 were fitted to an alignment and that the 
LLR = AD(6m1, Om2) was found to be statistically significant. 
This would lead an analyst to attribute the PRD(y) to real signal 
for the process y was meant to represent, possibly combined with 
some PL and noise. Now, consider the case in which the process 
represented by y did not actually occur (i.e., it was not a compo- 
nent of the true generating process). Under this scenario, PRD(y) 
would contain no signal, but would be entirely due to PL plus 
noise. When this is known to be the case, we set 
PRD(y) = PL(w). As illustrated below, PL(y) can be large enough 
to result in rejection of the null, and therefore lead to a false 
conclusion about the data generating process. 

We illustrate PL by contrasting the model RaMoSS with a 
companion model RaMoSSwDT that accounts for the fixation of 
DT mutations via two rate parameters, a (the double mutation rate) 
and 8 (the triple mutation rate) [26]. RaMoSS combines the stan- 
dard M-series model M3 with the covarion-like model CLM3 
(cf., [12, 18]). Specifically, RaMoSS mixes (with proportion py3) 
one model with two rate-ratio categories wo < o that are 
constant over the entire tree with a second model (with proportion 
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Fig. 4 The box plot depicts the distribution of the phenomenological load 
(PL) carried by (o, B) produced by fitting the RaMoSS vs RaMoSSwDT 
contrast to 50 alignments generated under MutSel-mmtDNA: the circles 
represent outliers of this distribution. The diamond is the percent reduction in 
deviance for the same parameters estimated by fitting RaMoSS vs RaMoSSwDT 
to the real mtDNA alignment 


Pcim3 = l — pm3) under which sites switch randomly in time 
between oi < œ) at an average rate of 6 switches per unit branch 
length. Fifty alignments were simulated to mimic a real alignment 
of 12 concatenated H-strand mitochondrial DNA sequences (3331 
codon sites) from 20 mammalian species as distributed in the 
PAML package [73]. The generating model, MutSel-mmtDNA 
[26], was based on the mutation-selection framework and pro- 
duced alignments with single-nucleotide mutations only. Since 
DT mutations are not fixed under MutSel-mmtDNA, the PRD 
carried by (a, OI in each trial can be equated to PL (plus noise). 
The resulting distribution of PL(@,#) is shown as a boxplot in 
Fig. 4. 

Although DT mutations were not fixed when the data was 
generated, shifting balance on a static landscape can produce similar 
site patterns as a process that includes rare fixation of DT mutations 
(site patterns exhibiting both synonymous and nonsynonymous 
substitutions; [26]).* DT and shifting balance are therefore con- 
founded. And since shifting balance tends to occur at a substantial 
proportion (approximately 20%) of sites when an alignment is 
generated under MutSel-mmtDNA, DT mutations were falsely 
inferred by the LRT in 48 of 50 trials at the 5% level of significance 
(assuming LLR ~y? for the two extra parameters a and f in 
RaMoSSwDT compared to RaMoSS). The PRD (â,ĝ) when 
RaMoSS vs RaMoSSwDT was fitted to the real mmtDNA is 


7 It has previously been noted that the rapid fixation of compensatory mutations following substitution to an 
unstable base pair (e.g., AT—GT—GC) can also produce site patterns that suggest fixation of DT mutations [74, 


p. 46]. 


5 Discussion 
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shown as a diamond in the same plot. Although (ĉ,ĝ) estimated 
from the real mmtDNA were found to be highly significant (LLR = 
84, p-value << 0.001), the PRD(@,) was found to be just under 
the 95th percentile of PL(&,ĝ) (PRD = 0.060% compared to the 
95th percentile of PL = 0.061). The evidence for DT mutations in 
the real data is therefore only marginal, and it is reasonable to 
suspect that its PRD(â,ĝ), if not entirely the result of PL, is at 
least partially caused by PL. 


CSMs have been subjected to a certain degree of censure, particu- 
larly during Phase I of their development [11, 22, 23, 46, 60-63, 
85]. We maintain that it is not the model in and of itself, or the 
maximum likelihood framework it is based on, that gives rise to 
statistical pathologies, but the relationship between model and 
data. This principle was illustrated by our analysis of the history of 
CSM development, which we divided into two phases. Phase I was 
characterized by the formulation of models to account for differ- 
ences in selection effects across sites and over time that comprise 
the major component of variation in an alignment. Starting with 
MO, such models represent large steps toward the fitted saturated 
model in Fig. 2, and also provide a better representation of the true 
generating process. The main criticism of Phase I models was the 
possibility of falsely inferring positive selection in a gene or at an 
individual codon site [62, 63, 85]. But, the most compelling 
empirical case of false positives was shown to be the result of 
inappropriate application of a complex model to a sparse alignment 
[63]. Methods for identifying (bootstrap) and dealing with (BEB, 
SBA, and PLRT) low information content were illustrated in Case 
Study A. 

The other big concern that arose during Phase I development 
was the possibility of pathologies associated with model misspecifi- 
cation. The method used to identify such problems was to fit a 
model to alignments generated under a scenario contrived to be 
challenging, as illustrated in Case Study B. There, the omnibus test 
based on Model A of the YN-BSM was shown to result in an excess 
of false positives when fitted to alignments simulated using the 
implausible but difficult “XZ” generating scenario (e.g., with com- 
plete relaxation of selection pressure at all sites on one branch of the 
tree; Table 1). Subsequent modifications to the test reduced the 
false positive rate to acceptable values. Hence, Case Study B under- 
lines the importance of the model-data relationship. However, it is 
not clear whether a model adjusted to suit an unrealistic data- 
generating process is necessarily more reliable when fitted to a real 
alignment. This difficulty highlights the need to find ways, for the 


422 


Christopher T. Jones et al. 


purpose of model testing and adjustment, to generate alignments 
that mimic real data as closely as possible. 

Confidence in the CSM approach, combined with the expo- 
nential increase in the volume of genetic data and the growth of 
computational power, spurred the formulation CSMs of ever- 
increasing complexity during Phase II. The main issue with these 
models, which has not been widely appreciated, is confounding. 
Two processes are confounded if they can produce the same or 
similar patterns in the data. It is not possible to identify such 
processes when viewed through the narrow lens of an alignment 
(De, site patterns) alone. This was illustrated by Case Study C, 
where shifting balance on a static landscape was shown to be 
confounded with episodic adaptive evolution [7,25]. Confounding 
can lead to what we call phenomenological load, as demonstrated in 
Case Study D. In that analysis, the parameters (a, f) were assigned a 
specific mechanistic interpretation, the rate at which double and 
triple mutations arise. It was shown that (a, p) can absorb variations 
in the data caused by shifting balance; hence, the MLEs (of 
resulted in a significant reduction in deviance in 48/50 trials 
(Fig. 4), and therefore improved the fit of the model to the data. 
However, the absence of DT mutations in the generating process 
invalidated the intended interpretation of (â, 8). This result under- 
lines that a better fit does not imply a better mechanistic represen- 
tation of the true generating process. 

It is natural to assume that a better mechanistic representation 
of the true generating process can be achieved by adding para- 
meters to our models to account for more of the processes believed 
to occur. The problem with this assumption is that the metric of 
model improvement under ML (reduction in deviance) is indepen- 
dent of mechanism. A parameter assigned a specific mechanistic 
interpretation is consequently vulnerable to confounding with 
other processes that can produce the same distribution of site 
patterns. As CSMs become more complex, it seems likely that the 
opportunity for confounding will only increase. It would therefore 
be desirable to assess each new model parameter for this possibility 
using something like the method shown in Fig. 4 whenever possi- 
ble. The idea is to generate alignments using MutSel or some other 
plausible generating process in such a way as to mimic the real data 
as closely as possible, but with the new parameter set to its null 
value. To provide a second example, consider the test for changes in 
selection intensity in one clade compared to the remainder of the 
tree known as RELAX [67]. Under this model, it is assumed that 
each site evolved under a rate ratio randomly drawn from on = { 
@ , .--, @,} on a set of prespecified reference branches, and from a 
modified set of rate ratios wr = {@7, ...,@}’} on test branches, 
where m is an exponent. A value 0 < m < 1 moves the rate ratios in 
or closer to one compared to their corresponding values in ox, 
consistent with relaxation of selection pressure at all sites on the test 
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branches. Relaxation is indicated when the contrast of the null 
hypothesis that m = 1 versus the alternative that m < 1 is statisti- 
cally significant. The distribution of PL(#) can be estimated from 
alignments generated with m = 1. The PRD(m) estimated from 
the real data can then be compared to this to assess the impact of PL 
(cf. Fig. 4). This approach is predicated on the existence of a 
generating model that could have plausibly produced the site pat- 
terns in the real data. Jones et al. [26] present a variety of methods 
for assessing the realism of a simulated alignment, although further 
development of such methods is warranted. Software based on 
MutSel is currently available for generating data that mimic large 
alignments of 100-plus taxa (Pyvolve; [56]). Other methods have 
been developed to mimic smaller alignments of certain types of 
genes (eg, MutSel-mmtDNA; [25]). It is only by the use of 
these or other realistic simulation methods that the relationship 
between a given model and an alignment can be properly 


understood. 
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Abstract 


Natural selection is a fundamental force shaping organismal evolution, as it both maintains function and 
enables adaptation and innovation. Viruses, with their typically short and largely coding genomes, experi- 
ence strong and diverse selective forces, sometimes acting on timescales that can be directly measured. 
These selection pressures emerge from an antagonistic interplay between rapidly changing fitness require- 
ments (immune and antiviral responses from hosts, transmission between hosts, or colonization of new host 
species) and functional imperatives (the ability to infect hosts or host cells and replicate within hosts). 
Indeed, computational methods to quantify these evolutionary forces using molecular sequence data were 
initially, dating back to the 1980s, applied to the study of viral pathogens. This preference largely emerged 
because the strong selective forces are easiest to detect in viruses, and, of course, viruses have clear 
biomedical relevance. Recent commoditization of affordable high-throughput sequencing has made it 
possible to generate truly massive genomic data sets, on which powerful and accurate methods can yield 
avery detailed depiction of when, where, and (sometimes) how viral pathogens respond to various selective 
forces. 

Here, we present recent statistical developments and state-of-the-art methods to identify and characterize 
these selection pressures from protein-coding sequence alignments and phylogenies. Methods described 
here can reveal critical information about various evolutionary regimes, including whole-gene selection, 
lineage-specific selection, and site-specific selection acting upon viral genomes, while accounting for 
confounding biological processes, such as recombination and variation in mutation rates. 


Key words Virus evolution, Molecular evolution, Recombination, Positive selection, Relaxed selec- 
tion, Phylogenetics, Codon models 


1 ‘Introduction 


Natural selection is a powerful evolutionary force that shapes gen- 
omes of all living organisms. A variety of computational approaches 
have been developed to measure the strength and direction of 
selection directly from genomic data. Given an alignment of 
homologous gene sequences, the strength of natural selection act- 
ing on a given gene or genes can be measured in a phylogenetic 
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context using codon models [1, 3]. A typical analysis on viral gen- 
omes, for example, might be performed for a single gene repre- 
sented by isolates from different individuals (e.g., sequences from 
many HIV-1-infected hosts) or from different hosts (e.g., primate 
lentiviruses). 

In the context of codon models, selection is typically measured 
using daN/dS (also referred to as œ, or Ka/Ks), which represents 
the ratio of the non-synonymous evolutionary rate (dN) to the 
synonymous evolutionary rate (dS). The synonymous evolutionary 
rate is used to provide a baseline rate of neutral evolution because 
the average selective effect of a synonymous substitution is assumed 
to be negligible compared to the effect of a non-synonymous 
substitution. ' The selective regime can be deduced by establishing, 
with a degree of statistical confidence, that dN/dS differs from 
unity, i.e., the neutral expectation where dN/dS = 1. Diversifying, 
balancing, or (sometimes) directional selection yields dN/dS > 1, 
whereas purifying selection effects dN/dS < 1. Comparative meth- 
ods for selection detection estimate dN/dS, or dS and dN sepa- 
rately, at sites and/or branches and perform a statistical test to 
establish on which side of the neutral expectation the inferences 
fall. As with any statistical procedure applied to finite data, each 
inference can be a false positive or a false negative, although meth- 
ods typically take care to control the rates of both. 

While the question “Is this gene under selection?” is an obvious 
one, the nearly universally applicable answer to this question is 
“yes.’. That is because a functional gene is (or has been) subject 
to some form of selection, e.g., negative selection to maintain 
essential features. On the other extreme is the question that has 
an immediate biological significance: “Is changing a leucine to an 
arginine at position 209 in gene X along a specific branch in the 
phylogeny adaptive?”. Without additional information, such as a 
carefully experimentally measured fitness impact of introducing 
said substitution, current comparative sequence approaches cannot 
answer this question. Indeed, such a scenario presents a sample size 
of one, which cannot be statistically meaningful. 

In this chapter, we present a collection of statistical methods, 
each of which is designed to carefully address a biological question 
somewhere on the spectrum between the two extremes: sufficiently 
specific to be interesting, yet general enough to be answerable 
based only on the evolutionary history of homologous sequences. 
We will not discuss the technical details of codon substitution 
methods here (for details, please see one of the excellent available 
reviews Anisimova and Kosiol [1], Delport et al. [3], Yang [44], or 
the primary methods papers including Goldman and Yang [8], 


l We note that there are a variety of well-documented situations where synonymous substitutions can have strong 


effects on fitness [11, 30]. 
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Kosakovsky Pond and Muse [13], Muse and Gaut [27], Nielsen 
and Yang [29]). Instead, we present each method operationally 


(“How and when does one use this method?”), by addressing the 
following points: 


1. What biological question is the method designed to answer? 
2. What are the recommended applications? 


3. What is the statistical procedure and statistical test used to 
establish significance for this method? 


4. How should one interpret positive and negative test results? 


5. Rules of thumb for when this method is likely to work well, and 
when it is not. 


We conclude by discussing how inferences can accommodate 
potentially confounding biological processes, including intragenic 
recombination and mutation rate variation. It is critical to model 
these processes, both in their own right and because ignoring their 
effects could bias selection inference tools and yield misleading 
results. 


The data used throughout the following tutorials and exercises 
are available at https: //github.com/veg/evogenomics_hyphy. A 
“README” file in the top directory of this repository provides 
a detailed description of all contents. Importantly, all datasets 
used here reside in the datasets directory. Please refer to 
http: //www.hyphy.org for instructions on downloading and instal- 
ling HyPhy to your system. All exercises have been validated using 
version 2.3.4. Throughout, we will use the hyphymp executable 
(MP = multiprocessor). For all analyses, you will need the following 
information: 


(a) the full path to all files being analyzed (alignment and tree), 
e.g., /home/user/data/alignment.fna, 


(b) the genetic code (in almost all cases, universal), and 


(c) level of statistical significance; suggestions are given below. 


All methods will produce a final file of results in JSON (Java- 
Script Object Notation) format, a highly extensible format that is 
simple, relatively compact, and both machine- and human- 
readable. JSON output files can be visually and interactively exam- 
ined within our new web application, hyphy-vision, accessible at 
vision.hyphy.org. 

All methods employ the general time reversible (GTR) nucleo- 
tide model for initial branch length optimization and correcting 
nucleotide substitution biases, followed by fitting a Muse—Gaut 
model (with general time reversible nucleotide biases) to obtain 
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3 Methods 


3.1 Howto Runa 
Selection Analysis 


preliminary dN/dS estimates (see Kosakovsky Pond and Frost [12] 
for a detailed model description) for selection inference. Codon 
frequencies are estimated using the CF3x4 procedure [15]. In our 
view, the historical rationale for using simpler evolutionary models 
(e.g., K80, F81, or HKY85), namely, computational cost, to fit 
nucleotide data is no longer relevant. 

Finally, we recommend different P-value thresholds depending 
on the given analysis method. As site-level methods (FEL, SLAC, 
and MEME) tend to be conservative on biological data, we recom- 
mend significance as P < 0.1 (or posterior probability > 0.9 for 
FUBAR). By contrast, we recommend significance as P < 0.05 for 
alignment-wide methods—BUSTED, RELAX, and aBSREL. 


There is a uniform workflow to run any of the described methods, 
either locally (on one’s own computer and/or a high-performance 
computing environment) in HyPhy or using the Datamonkey 
web-service, available at www.datamonkey.org. The version of 
HyPhy that supports all of the analyses is a command-line program, 
i.e., it must be run from a terminal prompt (similar to most other 
bioinformatics packages) in Linux or Mac OS X. It is also possible 
to run the program in Windows, with an appropriate POSIX emu- 
lation environment (e.g., MinGW) installed. 

To execute a selection analysis locally, the following steps will 
need to be taken. 


l. Prepare your coding sequence alignment. In general, any 
duplicate sequences should be removed before analysis. Most 
importantly, it is imperative that the sequence alignment be in 
the correct reading frame, meaning that alignment must be 
performed with codon structure in mind. A common approach 
to ensure this criterion is met is to generate the alignment using 
translated amino-acid data and then back-translate to the orig- 
inal nucleotide sequences. 


2. Prepare a phylogenetic tree from the multiple sequence align- 
ment. Note that certain analyses may require a labeled phylo- 
genetic tree, as indicated within each subsequent tutorial. Keep 
in mind that for most selection analyses, a tree topology is a 
nuisance parameter. Hence, while it is advisable to use good 
practices when inferring trees, minor errors in tree inference 
tend to have minor effects on gene- and site-level inference. A 
notable exception occurs when lineage-specific selection is 
investigated; in this case, ensuring high-quality tree topologies 
is important. 
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3. An essential and strongly recommended step before analyzing 
data for selection is to screen sequences for recombination. If 
recombinant sequences are naively analyzed without an appro- 
priate phylogenetic correction, inference results are likely to be 
biased (Posada et al. [33]) (see the section on Screening 
sequences for recombination later in this chapter). 


4. Prepare your data (alignment and phylogeny) for input to 
HyPhy. There are three ways to provide a dataset for HyPhy 
analysis, each of which will trigger a different analysis prompt at 
runtime: 


e Two separate files containing the alignment and phylogeny, 
respectively. In this circumstance, HyPhy issues two succes- 
sive prompts: the first for the file containing the alignment, 
and the second for the file containing the tree. 


e A single file containing an alignment in one of the formats 
supported by HyPhy (FASTA, MEGA, and PHYLIP), witha 
Newick-formatted phylogeny included at the bottom of this 
file. In this circumstance, HyPhy issues two successive 
prompts: the first for the file containing the alignment, and 
the second asking whether to accept the tree found in the file 
(provide the affirmative response, e.g., “y,” to accept it). 


e A NEXUS file containing both the alignment and phylog- 
eny. In this circumstance, HyPhy automatically accepts the 
provided phylogeny and therefore only issues a single 
prompt for the file containing the alignment. This is also 
the format that can be used to specify partitioned data, 
which is necessary to account for recombination. 


5. Execute the appropriate method in HyPhy, selecting options 
suitable for the specific analysis. 


Each method will provide live on-the-screen progress updates 
and, when finished, a text summary of the analysis. The output is 
generated in Markdown,” which can either be read directly as text 
or formatted using one of many Markdown viewers. 

When an analysis is finished, HyPhy will write a JSON file with 
numerous details about the analysis to disk. By convention, this file 
will be placed in the same directory as the input alignment file, with 
the added <method>.json extension, e.g., flu_ha.nex. 
BUSTED. json for an input alignment named flu_ha.nex ana- 
lyzed by the method BUSTED. All results contained in this JSON 
file can be explored visually within a web browser using a web 
application from the hyphy-vision suite of tools, accessible at 
vision.hyphy.org. Since JSON files can be easily accessed by 


? With the exception of GARD. 
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scripting and data-analysis languages, these are also well-suited for 
incorporation into pipelines." 

When run through www.datamonkey.org, this entire workflow 
is automated: one simply uploads an alignment, selects options for 
the analysis, and waits for the job to finish. Once the job has 
completed, the results will be displayed in an interactive application 
within the web browser. Note that Datamonkey will automatically 
remove duplicate sequences before executing any analysis. 


3.2 BUSTED 


What Biological Question Is the Method Designed to 
Answer?: 

Is there evidence that some sites in the alignment have been 
subject to positive diversifying selection, either pervasive 
(throughout the evolutionary tree) or episodic (only on some 
lineages)? In other words, BUSTED asks whether a given gene 
has been subject to positive, diversifying selection at any site, at 
any time [26]. If a priori information about lineages of interest 
is available (e g., due to migration, change in the environment, 
etc.), then BUSTED can be restricted to test for selection only on 
a subset of tree lineages, potentially boosting power. 


Recommended Applications 


1. Annotating a collection of alignments with a binary attribute: 
Has this alignment been subject to positive diversifying selec- 
tion (yes/no)? [34]. 

2. Testing small- or low-divergence alignments (ce, <~ 10 
sequences) for evidence of positive diversifying selection, 
where neither branch- nor site-level methods have sufficient 
power to detect weak, but present, signal. 


Statistical Test Procedure: 

Each (branch, site) pair evolves with œ; < @2 < 1, or @z > 1, 
with the ratio chosen independently of other (branch, site) pairs 
with probability pı, p2, Das (normalized to sum to 1). The three- 
rate œ distribution is estimated jointly from the entire align- 
ment, i.e., rates are shared by all (branch,site) combinations. 
Therefore, BUSTED is technically a “branch-site” model [16], 
although it is not intended to detect individual sites which drive 
signal of selection. 


3 Note that the method GARD does not provide markdown output or a JSON, and output is in a different format. 
This may be updated in a future HyPhy release. 
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The test for episodic diversifying selection is performed by 
comparing the full model versus the nested null model, where w3 
is constrained to 1. Statistical significance is obtained by the likeli- 
hood ratio test, assuming the y% asymptotic distribution of the 
likelihood ratio statistic under the null model. 

When only some of the branches are chosen for testing, and the 
remainder are designated as the background, two independent 
three-rate @ distributions are fitted: one for the test branches, and 
one for the background branches. Testing for selection is carried 
out by constraining the distribution on the test branches as 
described above. 


Example Analysis: 

To begin, we will perform a BUSTED analysis using a dataset of 
primate-specific KSR2, kinase suppressor of RAS2, genes from 
Enard et al. [5]. This gene has been implicated as a so-called 
‘virus-interacting protein,’ and previous work has suggested it 
has experienced adaptation in mammalian lineages due to 
selective pressures exerted by viruses [5]. We will test all lineages 
for positive selection (rather than specifying a subset of “test 
branches), thereby asking the question: “Has KSR2 been subject 
to diversifying selection at some time during evolution in 
primates?” 


To run BUSTED, open a terminal session and enter HYPHYMP 
from the command line to launch the HyPhy analysis menu. Enter 
1 (Selection Analyses) and then 5 to reach the BUSTED analysis 
menu, and supply values for the following prompts: 


1. Choose genetic code. This option tells HyPhy which transla- 
tion table to use for codon-level analyses. Enter 1 to use the 
Universal genetic code. 


2. Select a coding sequence alignment file. Provide the full path 
to the dataset of interest: /path/to/data/ksr2.fna. 


3. A tree was found in the data file... Would you like to use it 
(y/n)? Enter “y” to use the tree. 


4. Choose the set of branches to test for selection. Enter 1 to 
test all branches for selection. 


BUSTED will now run to completion, printing status indica- 
tors to screen while it runs. For an example of how this output will 
look when rendered into HTML (or similarly, PDF), see this link: 
http: //bit.ly/2vsRZrh. 
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Listing 1 Partial BUSTED screen output: 


HHH Branches to test for selection in the BUSTED analysis 
* Selected 15 branches to test in the BUSTED analysis: ‘HUM, PAN, Node6, GOR, 
Node5, PON, Node4, GIB, Node3, MAC, BAB, Nodel2, Node2, MAR, BUS‘ 


HEE Obtaining branch lengths and nucleotide substitution biases under the 
nucleotide GTR model 
* Log It) = -5768.01, AIC - c = 11582.06 (23 estimated parameters) 


### Obtaining the global omega estimate based on relative GTR branch lengths 


and nucleotide substitution biases 


* Log C2) = -5342.48, AIC - c = 10745.17 (30 estimated parameters) 
2 non - synonymous / synonymous rate ratio for *test* = 0.0342 
#H## Improving branch lengths , nucleotide substitution biases, and global dN/dsS 


ratios under a full codon model 


* Log ( L ) — -5333.46, AIC - c = 10727.13 (30 estimated parameters) 

* non - synonymous / synonymous rate ratio for *test* = 0.0307 

HHH Performing the full ( aN / dS > 1 allowed) branch-site model fit 

x Log ( L ) = -5319.67, AIC - c = 10707.62 (34 estimated parameters) 

* For * test * branches , the following rate distribution for branch-site 


combinations was inferred 


Selection mode | dn/ds Proportion, %| Notes 
EE een |------------- | ----------------- | 
Negative selection | 0.024 99.151 | | 
Negative selection | 0.085 0.812 | | 
Diversifying selection | 118.143 0.037 | | 
## Performing the constrained (dN/dS > 1 not allowed) model fit 


e Log ( L ) = -5326.18, AIC - c = 10718.63 (33 estimated parameters) 
* For * test * branches under the null (no dN/dS > 1 model), the following 


rate distribution for branch-site combinations was inferred 


Selection mode | dN/ds |Proportion, %| Notes | 
----------------------------- |--------------|------------- |---------------------=- | 
Negative selection | 0.000 | 10.598 | 
Negative selection | 0.000 | 86.086 | Collapsed rate class | 
Neutral evolution | 1.000 | 3.316 | | 
# Branch - site unrestricted statistical test of episodic diversification 
BUSTED] 
Likelihood ratio test for episodic diversifying positive selection, **p = 


0.0015**. 
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Interpreting Results: 
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The results printed to the terminal indicate a highly significant 
result (P = 0.0015) in the test for whole-gene selection. Analysis 
with BUSTED therefore provides robust evidence that KSR2 
experienced episodic positive selection in the primates. Because 
we performed the original BUSTED analysis on the entire tree 
i.e., without a specified set of test branches), we do not know 
from this result along which lineages KSR2 was subject to posi- 
tive selection. We can conclude only that a non-zero proportion of 
sites on some lineage(s) in the primate tree experienced diversify- 


ing selection pressure. 


The output additionally provided information about the spe- 
cific BUSTED model fits to the test data, including the inferred œ 
distributions and corresponding weights. The BUSTED alternative 
model (shown under the output header Performing the full ( 
dN/dS >1 allowed) branch-site model fit) found that a very 
small proportion (only ~0.037%) of sites evolved under a very large 
æ of over 100 (118.143 ). Importantly, neither of these estimates is 
precise because they were derived from a small subset of the data. As 
such, all the BUSTED tests establish the fact that the proportion of 
sites along test lineages (here, the entire phylogeny) with œ > 1 is 
non-zero. For example, if BUSTED had inferred a rate category of 
œ = 10 ona different gene, it would not be correct to claim that 
this gene evolves under weaker selection than does KSR2. A formal 
statistical test would have to be carried out to establish such a claim. 

Conversely, had the result not been statistically significant, we 
would not be able to reject the null hypothesis that no positive 
selection had occurred in KSR2. Importantly, however, a negative 
finding would not unequivocally rule out the presence of positive 
selection. This outcome could be due to a lack of statistical power 
wherein the provided data did not contain a sufficiently strong 


selection. 


BUSTED’s fixed a priori assumption of model complexity 
(a three-rate o distribution) may lead to over-parameterized 
(or under-parameterized) models. For example, in the constrained 
model for KSR2, two of the three rate classes have the same value of 
(0.0), implying that one of them is unnecessary. HyPhy will report 
this to the screen as a diagnostic message Collapsed rate 
class, but there is no corrective action that needs to be taken. 


These messages simply point to low-complexity data. 


We will additionally take this opportunity to showcase the 
visual power of our accompanying web browser, HyPhy-Vision. 
Figure 1 displays the rendering of the output ksr2.fna. 
BUSTED. json as it appears in HyPhy-Vision. On this site, users 
can interactively view and explore inference results, view figures and 


charts, and perform other tasks. 
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Fig. 1 Example analysis visualization in HyPhy-Vision of BUSTED results. (a) The summary section provides a 
brief overview of the analysis performed, including information about the inputted data (which can be 
downloaded via the linked file name) and primary results from the hypothesis test performed. (b) The 
model statistics section provides information about models fitted to the data. In BUSTED, this section 
additionally includes an interactive display of site evidence ratios, which can be interpreted as a descriptive 
measure for which sites may have contributed to the selection signal. (c) The tree section displays the 
phylogeny as fitted under all inferred models and data partitions, if specified. Tree views can be toggled under 
the Options drop-down menu. (d) Graphical views of each model’s inferred œ distribution can be viewed when 
clicking on a given row’s plot icon in the Model fits table seen in (b) 


Rules of Thumb for BUSTED Use 


1. Best applied to small- or medium-sized datasets (e.g., up to 
100 sequences). Larger datasets will take longer to run and may 
not be well described by a fixed complexity model. 


2. If one suspects that only a small subset of lineages is subject to 
selection, e.g., because the phenotype, environment, or fitness 
changed along those branches, designating those a priori as the 
test set will significantly boost power. 


3. In simulation studies, BUSTED performs best when a suffi- 
cient proportion (5-10%) of branch site combinations is sub- 
ject to positive diversifying selection, and the effect size (@ 
value) is reasonably large (e.g., > 3). 


3.3 RELAX 
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What Biological Question Is the Method Designed to 
Answer?: 

Is there evidence that the strength of selection has been relaxed 
(or conversely intensified) on a specified group of lineages (Test) 
relative to a set of reference lineages (Reference)? We note that 
the RELAX framework can perform this specific hypothesis test 
as well as fit a suite of descriptive models which address, for 
example, overall rate differences between test and reference 
branches or lineage-specific inferences of selection relaxation. 
We focus our attention here on RELAX’s hypothesis testing 
abilities. More information about descriptive analyses 1s avail- 
able on hyphy.org as well as in RELAX’s primary publication 
[43]. Importantly, RELAX is not designed to detect diversify- 
ing selection specifically. 


Recommended Applications 


l. Testing for a systematic shift (relaxation/intensification) in 
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the distribution of selection pressure associated with major 
biological transitions such as hosting switching in viruses [6] 
or lifestyle evolution in bacteria (i.e., transition from free-living 
to endosymbiotic lifestyle [43 ]). 


. Comparing selective regimes between two subsets of 


branches in the tree, e.g., to investigate selective differences 


among transmission routes in HIV-1 [42]. 


Statistical Test Procedure: 

Given a tree with at least two sets of branches, one of which is 
designated as Test, and the other as Reference, the core version 
of RELAX compares two nested models, which follow the same 
general framework as BUSTED. Each (branch, site) combina- 
tion is drawn independently from a 3-rate œ distribution. The 
evolutionary rates for Test branches are functions of those for 
Reference branches. Specifically, Wrest = OX ferences Where K is 
the relaxation or intensification parameter. The alternative 
model infers K from the data, and the null model sets K = 1. 
Statistical significance is obtained by the likelihood ratio test, 
assuming they? asymptotic distribution under the null model. A 
significant result of K > 1 indicates that selection strength has 
been intensified along the test branches, and a significant result 
of K < 1 indicates that selection strength has been relaxed along 
the test branches. In other words, for K < 1 the Test œ values 
shrink toward neutrality (» = 1) relative to Reference, and for 
K > 1 they move away from neutrality. 
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If some branches in the tree belong to neither the Test or the 
Reference set, they are allocated to a group with its own ( Unclassi- 
fied) distribution of œ, which is uncoupled from the testing 
procedure. 


Example Analysis: 

We will perform a RELAX analysis using a dataset of Influ- 
enza A PB2 subunit sequences from Tamuri et al. [41]. The 
PB2 subunit, which is part of influenza’s RNA polymerase 
complex, has emerged as a critical determinant of influenza 
infectivity and, as a consequence, host range [9, 18]. The dataset 
we examine here contains sequences from both avian host and 
human host strains.* Previous studies have shown that this host 
switch is correlated with significant shifts in selection pressures 
and preferred amino acids at key sites in PB2 [36, 40, 41]. We 
now re-analyze this dataset using RELAX to ask a different but 
related question: “Was the shift from avian to human hosts 


associated with a relaxation of selection pressures in Influenza 
A PB2?” 


RELAX requires an a priori specification of test and reference 
lineages, although not all lineages in a tree need to be classified. As 
such, you must label your test (and reference, if desired) branches in 
the input phylogeny. We provide an online widget to assist with tree 
labeling at http://phylotree.hyphy.org. The dataset we have 
provided for this analysis already has a labeled phylogeny, with the 
human host lineages labeled as “test.” 

To run RELAX, open a terminal session and enter HYPHYMP 
from the command line to launch the HyPhy analysis menu. Enter 
1 (Selection Analyses) and then 7 to reach the RELAX analysis 
menu, and supply values for the following prompts: 


l. Choose genetic code. Enter 1 to use the Universal 
genetic code. 


2. Select a coding sequence alignment file. Provide the full path 
to the dataset of interest: /path/to/data/pb2.fna. 


3. A tree was found in the data file... Would you like to use it 
(y/n)? Enter “y” to use the tree. 


4. Choose the set of branches to test for selection. This option 
asks you to specify the /adel inside your tree used to specify the 
test lineages. You can either select all unlabeled branches, or 
HyPhy will show all labels it found in the tree you provided. 


* The original dataset in Tamuri et al. [41] contained 401 sequences. For the purposes of this chapter, we analyze a 
subset of this alignment with only 35 sequences (20 from avian and 15 from human hosts), thereby achieving a 
tractable runtime on a personal machine. 
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Enter 1 to select the branches labeled as “test” as the test set in 
RELAX analysis. Note that when multiple labels are present in 
your tree, HyPhy will issue an additional prompt to choose the 
set of Reference branches, in the event that some branches 
should remain Unclassified. 


5. Analysis type. This option asks you to specify the scope of 
RELAX analysis. Selecting “Minimal” will run the RELAX 
hypothesis test, and selecting “All” will run hypothesis testing 
and fit two additional descriptive models, described earlier. 
Here, we will perform only hypothesis testing to determine 
whether the data shows evidence for a relaxation or intensifica- 
tion of selection intensity between the test and reference 
lineages. Enter the option 2 to run the “Minimal” analysis. 


RELAX will now run to completion, printing status indicators 
to screen while it runs. 


Listing 2 Partial RELAX screen output: 
Gët Obtaining branch lengths and nucleotide substitution biases under the 


nucleotide GTR model 
* Log (L) = -16755.26, AIC - c 


I 


33660.66 (75 estimated parameters) 


Gët Obtaining the global omega estimate based on relative GTR branch lengths 


and nucleotide substitution biases 


* Log ( L ) = -14410.97, AIC - c = 28988.46 (83 estimated parameters) 
e non - synonymous / synonymous rate ratio for *Reference* = 0.0401 

* non - synonymous / synonymous rate ratio for *Test* = 0.0604 

HEHE Improving branch lengths , nucleotide substitution biases, and global dN/ds 
ratios under a full codon model 

* Log ( L ) = -14354.67, AIC - c = 28875.86 (83 estimated parameters) 
e non - synonymous / synonymous rate ratio for *Reference* = 0.0358 

SZ non - synonymous / synonymous rate ratio for *Test* = 0.0609 

Gët: Fitting the alternative model to test K != 1 

* Log ( L ) = -14337.22, AIC - c = 28849.02 (87 estimated parameters) 
* Relaxation / intensification parameter (K) = 0.73 


e The following rate distribution was inferred for **test** branches 


| Selection mode | dN/ds |Proportion, %| Notes | 


| Negative selection | 0.031 | 94.752 | | 
| Negative selection | 0.086 | 2.951 | | 
| Diversifying selection | 1.406 | 2.297 | | 
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2 The following rate distribution was inferred for **reference** branches 
Selection mode | dN/ds |Proportion, % Notes | 
-------------------------------- l--------------|------------- | -----------------=- | 
Negative selection | 0.009 | 94.752 | 
Negative selection | 0.035 | 2.951 | 
Diversifying selection | 1.591 | 2.297 | 
## Fitting the null RE Së 1) model 
MM Log ( L ) = -14342.33, AIC - c = 28857.22 (86 estimated parameters) 
x The following rate distribution for test/reference branches was inferred 
Selection mode | dN/ds Proportion, % Notes | 
-------------------------------- l--------------|------------- | -----------------=- | 
Negative selection | 0.010 94.149 | 
Negative selection | 0.021 3.391 | 
Diversifying selection | 1.735 2.460 | 
# Test for relaxation ( or intensification) of selection [RELAX] 
Likelihood ratio test EE, e) = 0.0014**. 


> Evidence for * relaxation of  selection* among **test** branches _relative_ to 


the **reference** branches at P<=0.05 


Interpreting Results: 

On this data, RELAX has inferred a relaxation parameter 
K= 0.73 with a highly significant P = 0.0014. Therefore, 
there is evidence to reject the null hypothesis that selection pres- 
sure has not been shifted in the test (here, human host) lineages. 
We instead have strong evidence that selection has been relaxed 
(because the inferred K < 1) in the human host lineages. In 
other words, selection in the test branches has generally moved 
towards neutrality (© = 1) compared to the reference branches. 
This finding is consistent with the evolutionary changes that 
typically occur during a virus host-switching event, wherein 
selection stringency will be reduced to facilitate viral 
adaptation. 


Keep in mind that RELAX defines relaxation (or intensifica- 
tion) in a fairly restrictive fashion. In other words, all selective 
regimes (i.e., all œ rates), both negative and positive, must weaken 
or strengthen. Therefore, certain relaxation scenarios, for example, 
when only positive selection is relaxed but negative selection is 
maintained, may result in a non-significant RELAX test even 
though selection has changed. 


3.4 aBSREL 
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Rules of Thumb for RELAX Use 


1. Always provide a labeled phylogeny indicating which branches 
to include in the “test” lineages. You can additionally label 
“reference” lineages if you wish to keep some branches as 
unclassified. It is convenient to use the phylotree. 4s online 
widget at http://phylotree.hyphy.org/ to label branches 
before analysis. 


It is often of interest to determine whether a specific lineage or 
lineage(s) have been subject to selection. Such analyses have histor- 
ically been performed using the so-called branch or branch-site 
class of models, which allow evolutionary rates to vary across 
branches or across sites and branches [16, 45, 46]. Early versions 
of branch-site models allowed users to compare selection pressure 
on a pre-selected branch sets of “foreground” branches to a 
pre-selected set of “background” branches, on which positive selec- 
tion was disallowed [45, 46]. (Note that this approach is similar to 
how BUSTED performs gene-wide selection inference [26].) Later 
efforts demonstrated that disallowing positive selection on back- 
ground branches could lead to highly elevated false positive rates 
and advocated a strategy wherein any branch, regardless of data 
partition, could evolve at any rate [16]. This strategy has been 
described as the BS-REL model in HyPhy [16]. However, in 
BS-REL, each branch was constrained to have three rate categories, 
an assumption with little justification. 

Since then, we have developed a greatly improved branch-site 
model called aBSREL (“adaptive branch-site random effects likeli- 
hood”). Rather than assuming that each branch should be fit with 
three rate classes, aBSREL infers, using small-sample Akaike Infor- 
mation Criterion correction (AICc), the optimal number of rate 
categories per branch. In this manner, computational complexity 
and the number of parameters are greatly reduced, leading to a 
tractable runtime for larger datasets that could not otherwise be 
studied with earlier branch-site models. 


What Biological Question Is the Method Designed to 
Answer?: 

Like classical branch-site models, aBSREL asks whether some 
proportion of sites is subject to positive selection along specific 
branches or lineages of a phylogeny. 


Recommended Applications 


l. Exploratory testing for evidence of lineage-specific positive 
diversifying selection in small- to medium-sized alignments 
(up to 100 sequences). 
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2. Targeted testing of branches selected a priori for positive 
diversifying selection. This includes alignments with prohibi- 
tive runtimes under older branch-site models (up to ~1000 
sequences) [37]. 


Statistical Test Procedure: 

aBSREL uses the information-theoretic criterion AIC, to auto- 
matically determine the complexity of the evolutionary process at 
every branch [37]. As a heuristic optimization, aBSREL will 
always examine branches in order from longest to shortest, 
because longer branches tend to be the ones requiring more 
complex models. In this adaptive model, one rate class 1s allowed 
to assume any value of œ > 1, whereas for any other inferred 
rate class is constrained as œ < 1. In the null model, all œ 
categories are constrained arm < 1. For any branch inferred 
to have sufficient rate variation (e, more than one rate cate- 
gory) where one rate category is described by œ > 1, aBSREL 
will proceed to fit a null model to this branch. In other words, if 
the maximum-inferred œ < 1 on a branch, the null model will 
have the same exact fit as the alternative model, and the result- 
ing P-value is 1. The test for lineage-specific diversifying selec- 
tion is performed by comparing the full model versus the nested 
null model, and statistical significance is obtained by the likeli- 
hood ratio test. Significance is evaluated using a mixture of 
Sich, 20 fie, and 30%y> distributions (proportions deter- 
mined via simulations Smith et al. [37]). Finally, aBSREL will 
correct all P-values obtained from individual tests for multiple 
comparisons using the Bonferroni—Holm procedure to control 
family-wise false-positive rates (i.e., the probability of generating 
one or more false positives, when all null hypotheses ave correct). 


One can either select a specific set of branches in order to test a 
specific a priori hypothesis or one can perform an exploratory 
analysis across the entire phylogeny by testing all branches for 
selection. The former approach may have substantially more 
power to detect selection, especially if only a few branches in a 
large tree are chosen, due to the decreased volume of multiple 
testing. However, the approach does carry the risk of failing to 
identify branches subject to positive selection that have not been 
included in the test set. 
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Example Analysis: 

Here, we will demonstrate aBSREL use and interpretation 
using a dataset of HIV-1 env sequences collected from an epide- 
miologically linked donor—recipient transmission pair [7]. This 
dataset can be found in the provided file hivi_transmission. 


fna. 


To run aBSREL, open a terminal session and enter HYPHYMP 
from the command line to launch the HyPhy analysis menu. Enter 
1 (Selection Analyses) and then 6 to reach the aBSREL analysis 
menu, and supply values for the following prompts: 


1. Choose genetic code. This option tells HyPhy which transla- 
tion table to use for codon-level analyses. Enter 1 to use the 
Universal genetic code. 


2. Select a coding sequence alignment file. Provide the full path 
to the dataset of interest: /path/to/hivl_transmission. 
fna. 


3. A tree was found in the data file. ..Would you like to use it 
(y/n)? Enter “y” to use the included tree. 


4. Choose the set of branches to test for selection. You can now 
select on which branches aBSREL should conduct a formal 
hypothesis test for positive selection. Enter 1 to test all 
branches for selection. 


aBSREL will now run to completion, printing status indicators 
to screen while it runs (some output abbreviated). 


Listing 3 Partial aBSREL screen output: 


HEHE Obtaining branch lengths and nucleotide substitution biases under the 
nucleotide GTR model 
* Log ( L ) = -5524.50, AIC - c = 11153.08 (52 estimated parameters) 


Gët: Fitting the baseline model with a single dN/dS class per branch, and no 
site-to-site variation. 

* Log ( L ) = -5402.40, AIC - c = 11009.72 (102 estimated parameters) 

4 Branch - level non - synonymous / synonymous rate ratio distribution has median 
0.66, and 95% of the weight in 0.00--5.41 


#HH Determining the optimal number of rate classes per branch using a step up 


procedure 
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Branch Length | Rates Max. dN/dsS Log (L) AIC-c Best AIC-c so far 
0564 _22 0.01 2 1.96 (52.27% -5402.41 11013.78 11009.72 
0564 _7 0.01 2 0.74 ( 5.19%) -5402.40 11013.76 11009.72 
Separator 0.01 2 197.32 ( 3.95% -5397.53 11004.02 11004.02 
Separator 0.01 3 180.22 ( 4.08% -5397.53 11008.06 11004.02 
0564 A 0.01 2 29.79 ( 2.15%) -5394.37 11001.74 11001.74 
0564 A 0.01 3 29.78 ( 2.15%) -5394.37 11005.78 11001.74 
0564 3 0.01 2 126.86 ( 3.14% -5388.59 10994.22 10994.22 
0564 3 0.01 3 135.96 ( 3.05% -5388.59 10998.25 10994.22 
0564 _9 0.01 2 10.01 ( 8.61%) -5388.37 10997.82 10994.22 
Node53 0.00 2 1.00 (100.003) -5371.63 10976.46 10971.76 
0557 _6 0.00 2 27.66 (100.00%) -5371.32 10975.83 10971.76 
0557 _21 0.00 2 0.25 ( 1.96%) -5371.30 10975.80 10971.76 
0557 _7 0.00 2 0.25 ( 1.96%) -5371.30 10975.80 10971.76 
## Rate class analyses summary 

S 38 branches with Ke rate classes 

D 6 branches with KEDER rate classes 


## Improving parameter estimates of the adaptive rate class model 

e Log ( L ) = -5370.66, AIC - c = 10970.49 (114 estimated parameters) 

## Testing selected branches for selection 
Branch Rates Max. dN/dS Test LRT | Uncorrected p-value 
0564 _22 1 1.22 (100.00%) Fa lia 0.43015 
0564 _7 1 0.61 (100.00%) .00 1.00000 
Separator 2 197.72 ( 3.95%) 14.13 0.00029 
0564 _4 2 28.89 ( 2.15%) .81 0.03281 
0564 3 2 127.66 ( 3.14%) 14.06 0.00030 
0564 _9 1 0.72 (100.00%) .00 1.00000 
0564 _1 £ 1.07 (100.00%) 0.01 0.48208 
0557—21 1 1.00 (100.00%) 0.00 1.00000 
0557 _7 al 1.00 (100.00%) 0.00 1.00000 

## Adaptive branch site random effects likelihood test 


Likelihood ratio test for episodic diversifying positive selection at Holm- 


Bonferroni corrected _p = 0.0500_ found **3** branches under selection among **44** 
tested. 

kg Node35 , p - value = 0.00018 

* Separator , p - value = 0.01251 


* 0564 3 , p - value = 0.01266 
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Interpreting Results: 

The first printed markdown table ("Determining the optimal 
number of rate classes per branch using a step up procedure Wi 
summarizes the model selection process. For example, when two œ 
rates were assigned to branch Separator, this improved the AIC. 
score of the fit (compared to the single-rate model) from 
11, 009.72 to 11, 004.02. However, allocating three œ rates to 
the same branch worsens the score to 11, 008.06. Therefore the 
aBSREL model will use two œ rates at the branch. 


The second printed markdown table ("Testing selected 
branches for selection") shows the results of tests for episodic 
selection on individual branches. At branch 0564_4, for example, 
the tested model includes two @ rates, with the positive selection 
class taking on value 28.89 (2.15% proportion of the mixture). 
Constraining this rate to range between 0 and 1 yields the likeli- 
hood ratio test statistic of 4.81, which maps to a P-value (before 
multiple test correction) of 0.03281. 

Finally, aBSREL reports three branches under episodic diversi- 
fying selection pressure. Further examination of results using 
HyPhy-Vision shows that these branches are found (a) along the 
transmission event from donor to recipient, and (b) within a highly 
diverged clade in the donor (Fig. 2). The first finding is consistent 
with an expected increase in evolutionary rate when a virus infects a 
new host and encounters novel host immunity, and the second 
finding is consistent with intrahost adaptive dynamics of the 
donor’s long-term HIV infection. Importantly, a close examination 
of the markdown-output table under the header "Testing 
selected branches for selection" reveals several nodes 
with uncorrected P-values whose significance was lost upon apply- 
ing the Bonferroni-Holm correction, e.g., 0564_4 whose uncor- 
rected P= 0.03281. This result illustrates the potential loss of 
power incurred by this aBSREL exploratory analysis. 


Rules of Thumb for aBSREL Use 


l. A priori identification of branches to test for selection will 
generally increase power to detect selection on those branches. 
That said, to maintain statistical robustness, we strongly discour- 
age performing multiple separate tests for selection on different 
branch sets. Such an approach will necessarily introduce false 
positives. In such a case, we recommend performing an explor- 
atory analysis wherein all branches are considered. 


2. Exploratory analyses of very large datasets are unlikely to yield 
many significant results, because correcting for multiple testing 
will reduce power as the number of branches grows, while the 
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Fig. 2 HyPhy-Vision tree viewer depicting the fitted aBSREL Adaptive model to HIV-1 data. Branches are 
colored by their inferred œ distribution, as indicated in the legend. Lineages identified as positive selection at 
P < 0.05 after correction for multiple testing are shown with thick branches, with color distributions 
representing the relative values and proportions of inferred œ categories. Note that taxon labels beginning 
with “0554” represent HIV-1 sequences derived from the donor patient, and labels beginning with “0557” 
represent HIV-1 sequences derived from the recipient patient 


amount of statistical signal does not increase for larger datasets. 
One option is to thin out large phylogenies (before performing 
any testing), retaining major clades and lineages of interest. 
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3.5 Site-Level 
Selection: MEME, FEL, 
SLAC, and FUBAR 


What Biological Question Is the Method Designed to 
Answer?: 


The methods FEL, SLAC, and FUBAR address the question: 
Which site(s) in a gene are subject to pervasive, t.e., consistently 
across the entire phylogeny, diversifying selection? MEME 
addresses a more general question: Which site(s) in a gene are 
subject to pervasive or episodic, t.e., only on a single lineage or 
subset of lineages, diversifying selection? 


Recommended Applications 


1. MEME is the sole method in HyPhy for detecting selection at 


individual sites that considers both pervasive and episodic selec- 
tion. MEME is therefore our recommended method if maxi- 
mum power is desired. 


. The phenomenon of pervasive selection is generally most prev- 
alent in pathogen evolution and any biological system influ- 
enced by evolutionary arms race dynamics (or balancing 
selection), including adaptive immune escape by viruses. As 
such, FEL, SLAC, and FUBAR are ideally suited to identify 
sites under positive selection which represent candidate sites 
subject to strong selective pressures across the entire phylog- 
eny. Each of these methods has a particular use case as well: 


e FEL is our recommended method for analyzing small-to- 
medium size datasets when one wishes only to study perva- 
sive selection at individual sites. 


e FUBAR is our recommended method for detecting perva- 
sive selection at individual sites on large (> 500 sequences) 
datasets for which other methods have prohibitive runtimes, 
unless you have access to a computer cluster. 


e SLAC provides legacy functionality as a counting-based 
method adapted for phylogenetic applications. In general, 
this method will be the least statistically robust. 


Statistical Test Procedure: 

Each method presented here employs a distinct algorithmic 
approach to inferring selection. FEL uses maximum likelihood 
to fit a codon model to each site, thereby estimating a value for 


dN and dS at each site. FEL tests for selection with the likelihood 


ratio test using the yj distribution, asking whether the dN 
estimate is significantly greater than the inferred dS estimate. 


(continued) 
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SLAC represents the most basic inference method and is an 
extension of the Suzuki-Gojobori counting-based method [39] 
for phylogenetically related sequences (as opposed to sequence 
pairs). SLAC uses maximum likelihood to infer ancestral char- 
acters for each site across the phylogeny and then directly counts 
the number of synonymous and non-synonymous changes which 
have occurred at each site over evolutionary time. SLAC then 
tests for selection by testing whether or not there are too many or 
too few non-synonymous changes compared to what is expected 
under neutrality. The neutral expectation is derived based on 
the phylogeny-wide estimated numbers of synonymous and 
non-synonymous nucleotide sites at a given codon. The statistical 
test employs the binomial distribution to compute significance, 
eg., how likely Ze it to observe 13 non-synonymous and 1 synony- 
mous substitutions at a site, if the expected synonymous to 
non-synonymous substitution count ratio under neutrality is 1:4? 

MEME employs a mixed-effects maximum likelihood 
approach. For each site, MEME infers two œ rate classes and 
corresponding weights representing the probability that the site 
evolves under each rate class at a given branch. To this end, 
MEME infers a single « (dS) parameter and two separate B 
(dN) parameters, D and B,. The œ rates per site, therefore, 
consist of B,/a and B_/a. MEME uses this framework to fit a 
null and alternative model each, both models enforcing the 
constraint B_< a. The null model disallows positive selection by 
enforcing the constraint B, < a, whereas the alternative model 
places no constraint on HB. MEME uses the likelihood ratio test to 
compare between null and alternative model fits, with signifi- 
cance assessed using the mixture of 33 %yo, 30%, and 37 iz. 

FUBAR takes a Bayesian approach to selection inference 
and 1s a particular case of statistical models developed in the 
context of document classification (latent Dirichlet allocation). 
The key innovation to FUBAR?’s approach is its use of an a priori 
specified grid of AN and dS values (typically 20 x 20), span- 
ning the range of negative, neutral, and positive selection 
regimes, whose likelihoods can be pre-computed and used 
throughout analysis (rather than having to re-compute likeli- 
hoods during optimization as traditional random-effects 
approaches do [12, 29]). This approach, combined with other 
algorithmic advances, speeds computation time by at least an 
order of magnitude compared to FEL, while yielding compara- 
ble statistical performance. FUBAR estimates every model 
parameter except the proportion of sites allocated to each grid 
point using simple (and fast) nucleotide models. The proportions 
ave estimated using an MCMC procedure, and non-neutral 
evolution at each site is inferred using a straightforward naive 


(continued) 
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empirical Bayes approach [29]. Sites are called positively or 
negatively selected if the corresponding posterior probabilities 


ave sufficiently high. 


Note that FEL and SLAC report both positively and negatively 
selected sites, but MEME and FUBAR report only sites under 


positive selection. 


Example Analysis: 


We will demonstrate the use and interpretation of site-level 
methods using data from influenza strain H3N2 (the “Hong 
Kong flu”), the primary circulating strain of seasonal influ- 
enza since the late 1960s. We specifically will assess selection on 
the H3 hemagglutinin, the influenza surface protein which is 
responsible for host cell binding. Hemagglutinin experiences 
rapid evolution triggered by host immune escape, and previous 
studies have identified numerous signatures of positive diversify- 
ing selection in H3 sequences with a particular concentration 


around the host-binding domain [28]. 


We base analyses here on an alignment from Meyer and Wilke 
[22] of H3 sequences sampled over time since the 1991-1992 
influenza season. We removed all partial and strongly outlying 
sequences (i.e., those with excessive divergence) from the original 
dataset before proceeding, yielding 2555 sequences to comprise 
our “full” H3 dataset. We further subsetted this alignment to two 
smaller alignments with comparable numbers of taxa but spanning 
different evolutionary time frames: The first smaller alignment 
(“trunk”) contains 163 sequences sampled along the influenza 
H3 trunk, whereas the second smaller alignment (“shallow”) con- 
tains 121 sequences sampled from a single clade (Fig. 3). There- 
fore, while these two smaller datasets contain a comparable number 
of sequences, the trunk dataset spans a much longer time frame and 
contains substantially more sequence divergence relative to the 
shallow dataset. Indeed, the trunk dataset has a total tree length 
(sum of branch lengths, in units substitutions/site/unit time) of 
0.43, whereas the shallow dataset had a total tree length of 0.12, 
meaning that the trunk dataset contains nearly four times the 
amount of sequence divergence seen in the shallow dataset. We 
have compiled results for all three datasets analyzed with all four 
methods (Table 1). We now describe, using the trunk dataset as an 


example, how to run each of these analyses in HyPhy. 


Full 
e Shallow 


e Trunk 


e Shallow and trunk 


Fig. 3 Phylogeny of H3 hemagglutinin sequences analyzed here. Tip colors indicate those selected for each 
dataset 


Table 1 
Sites identified as positively selected across the H3 datasets analyzed here 


Dataset Method Sites under selection at P < 0.1* 

Full H3 MEME (16) 19, 47, 61, 69, 110, 151, 154, 156,173, 208, 236, 241, 277, 
278, 292, 538 

Full H3 FEL (15) 19, 47, 61, 69, 110, 154, 156, 173, 236, 237, 241, 277, 278, 
292, 538 

Full H3 SLAC (19) 19, 47, 61, 69, 110, 137, 154, 156, 158, 173, 189, 208, 236, 
237, 241, 277, 278, 292, 505, 546 

Full H3 FUBAR 13) 47, 61, 69,110, 154, 160, 173, 208, 236, 237, 241, 278, 538 


( 
Shallow H3 MEME (2) 49, 320 
Shallow H3 FEL (2) 49, 241 
Shallow H3 SLAC None 
Shallow H3 FUBAR 
Trunk H3 MEME 


3) 19,49, 241 
6) 64, 154,171, 208, 242, 402 
3) 64, 154, 208 

2) 154, 208 

Trunk H3 FUBAR (6) 61,64, 69, 154, 208, 242 


Trunk H3 FEL 
Trunk H3 SLAC 


Bold sites are those identified by multiple methods for a given dataset. Bold italicized sites are those identified in more 
than one dataset, generally by more than one method. Numbers in parentheses give the total number of positively 
selected sites identified with the given method and dataset 

* For FUBAR, significance is assessed as posterior probability > 0.9 
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FEL: Launch HyPhy from the command line, and enter 
options 1 (Selection Analyses) and then 2 to reach the FEL analysis 
menu, and supply values for the following prompts: 


1. Choose genetic code. Enter 1 to use the Universal genetic code. 


2. Select a coding sequence alignment file. Provide the full path 
to the dataset of interest: /path/to/data/h3_trunk.fna. 


3. A tree was found in the data file. ..Would you like to use it 
(y/n)?. Enter “y” to use the tree. 


4. Choose the set of branches to test for selection. This option 
allows you to specify which branches along which site-level 
inference should be performed. Enter 1 to test all branches 
for selection. 


5. Use synonymous rate variation?. This option asks you to spec- 
ify whether the dS parameter in the codon model should be 
allowed to vary across sites (“Yes”) or be fixed to 1 at all sites 
(“No”). Enter 1 to use a model with synonymous rate variation. 


6. Select the P-value used to perform the test at (permissible 
range = [0,1], default value = 0.1). Provide the default 
threshold of 0.1. 


FEL will now run to completion and print status indicators to 
the screen, including results for any site found to be under selection 
(either positive or negative). Abbreviated results are shown below. 


Listing 4 Partial FEL screen output: 


Gët Obtaining branch lengths and nucleotide rates under the GTR model 
* Log ( L ) = -7506.06 


HEE Obtaining the global omega estimate based on relative GTR branch lengths 


and nucleotide substitution biases 


* Log (L) = -7302.10 

x non - synonymous / synonymous rate ratio for *test* = 0.2923 

Gët: Improving branch lengths , nucleotide substitution biases, and global dN/ds 
ratios under a full codon model 

* Log (L) = -7289.65 

ZS non - synonymous / synonymous rate ratio = 0.2598 


HHH For partition al these sites are significant at p <=0.1 


| Codon | Partition | alpha | beta | LRT | Selection detected? 
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146 
152 
154 
159 


HHH 


EE) 


Found 


3.818 0.000 7.336 eg. p = 0.0068 
1.968 0.000 3.634 Neg. p = 0.0566 
0.000 3.912 4.652 Pos. p = 0.0310 
4.413 0.716 2.972 Neg. p = 0.0847 
2.082 0.000 2.713 eg. p = 0.0995 
1.659 0.000 2.986 Neg. p = 0.0840 
6.393 0.000 8.421 eg. p = 0.0037 
1.928 0.000 3.286 Neg. p = 0.0699 
2.085 0.000 2.715 eg. p = 0.0994 
1.645 0.000 3.370 eg. p = 0.0664 
0.000 3.625 4.668 Pos. p = 0.0307 
3 sites under pervasive positive diversifying and _115_ 


sites under negative selection at p <= 0.1** 


Inference details for codons with significant likelihood ratio 
tests for positive or negative selection are reported to the screen. 


Codon The codon where non-neutral evolution has 
been detected. 
Partition Allows one to keep track which subset of the 


alignment a particular site belongs to. This is 
important for recombination-corrected parti- 
tion analyses. 


alpha Site-specific synonymous substitution rate 
beta Site-specific non-synonymous substitution rate 
LRT Site-specific likelihood ratio test statistic for 


non-neutral evolution (alpha ¥ beta) 
Selection detected? Selection classification (positive or negative) 
and the corresponding P-value 


Note that the “Codon” and “Partition” columns are common 
to all site-specific analyses. 


MEME and SLAC: SLAC and MEME follow identical menu 
prompts as FEL, with the exception that only FEL will prompt 
for synonymous rate variation. Instead, SLAC has a different 
prompt for Step 5: Select the number of samples used to assess 
ancestral reconstruction uncertainty. If this number is positive, 
then HyPhy will draw samples from the distribution of ancestral 
states and use them to measure whether or not inference is sensitive 
to ancestral inference uncertainty. When you encounter this option, 
provide the default value of 100 (or 0 to forego sampling). MEME 
does not emit any additional prompts. 
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Listing 5 Partial SLAC screen output: 
## For partition 1 these sites are significant at p <=0.1 
Codon Partition S N ds ON |Selection detected? 
E E E sence essa sane | toe eos y| p----------------: 
146 1 3.000 0.000 3.000 0.000 | Neg. p = 0.037 
154 1 0.000 8.000 0.000 4.000 | Pos. p = 0.039 
177 1 3.000 0.000 4.038 0.000 | Neg. p = 0.020 
208 1 0.000 6.000 0.000 2.994 | Pos. p = 0.089 
## Ancestor sampling analysis 
> Generating 100 ancestral sequence samples to obtain confidence intervals 
Resampling results for partition 1 


Codon| Part. |S[median, IQ 


146 |1 |3.00 [3.00-3.00] 
154 |1 |0.00 [0.00-0.00] 
177|1| 3.00 [3.00-4.00] 


R] |N({median, IQR 


0.00 [0.00-0.00 
8.00 [8.00-8.00 
0.00 [0.00-0.00 


208|1| 0.00 [0.00-0.00] 


6.00 [6.00-6.00 


ds [median, IQR] |d 


.00 [3.00-3.00] 
-00 [0.00-0.00] 
.04 [4.04-5.38] 


oF O W 


.00 [0.00-0.00] 


[median, IQR] |p-value [median, IQR] 


N OG O 


.00 [0.00-0.00] | 0. 
.00 [4.00-4.00] | 0. 
.00 [0.00-0.00] | 0. 
.99 [2.99-2.99] | 0. 


04 [0.04-0.04]. 
04 [0.04-0.04] 
02 [0.01-0.02] 
09 [0.09-0.09] 


SLAC reports several key quantities for codons with significant 
P-values for positive or negative selection to the screen. 


S 


NS 


dS 
dN 


Selection detected? 


The number of synonymous substitutions 
inferred at this site 
The number of non-synonymous substitutions 
inferred at this site 
Estimated site-specific synonymous rate 

Estimated site-specific non-synonymous rate 


Selection classification (positive or negative) 


and the corresponding P-value for the binomial 


test 


If the user elected to perform ancestral resampling, another 
table is reported, showing how much these quantities are affected 
by ancestral state reconstruction uncertainty. For example, at codon 
177, some ancestral reconstructions yielded 3 synonymous substi- 
tutions, whereas others yielded 4; however, this was not sufficient 
to move the P-value on different sides of the threshold. 
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Listing 6 Partial MEME screen output: 


Codon |Partition 


6 


alpha | beta+ pt LRT |Episodic selection detected? | #branches 
0.000 |14.717|0.204/3.512 Yes, p = 0.0816 5 
0.000 | 35.302|0.145|5.334 Yes, p = 0.0317 8 
0.000 | 45.005 |0.017|5.753 Yes, p = 0.0256 1 
0.000 |59.749|0.089/5.554 Yes, p = 0.0283 6 
1.839 | 34.114]|0.216/4.273 Yes, p = 0.0549 gi 
0.000 | 10.476 | 0.091 |3.493 Yes, p = 0.0824 2 


sites under episodic diversifying positive selection at p 


MEME prints information only about codons subject to posi- 
tive selection, since MEME does not directly test for negative 


selection. 


alpha 


beta+ 


LRT 


Episodic selection detected? 


# branches 


Site-specific synonymous substitution 
rate 

Site-specific non-synonymous substi- 
tution rate for the positive selection 
category 

Site-specific weight (~ proportion of 
branches) assigned for the positive 
selection category 

Site-specific likelihood ratio test sta- 
tistic for episodic diversifying selec- 
tion (beta+ > 1 and p+ > 0) 
Selection classification (yes) and the 
corresponding P-value 

An exploratory estimate of the num- 
ber of individual branches which have 
sufficient empirical Bayes support for 
positive selection; since MEME pools 
signal from multiple branches, there 
may be overall evidence for selection, 
without necessarily implicating any 
individual branches. 


FUBAR: To run FUBAR, launch HyPhy from the command line, 
and enter options 1 (Selection Analyses) and then 4 to reach the 
FUBAR analysis menu, and supply values for the following 


prompts": 


5 Note that for all prompts with default values, simply pressing enter will choose this default. 


Evolution of Viral Genomes: Interplay Between Selection, Recombination. .. 455 


. Choose genetic code. Enter 1 to use the Universal 


genetic code. 


. Select a coding sequence alignment file. Provide the full path 


to the dataset of interest: /path/to/data/h3_trunk.fna. 


. A tree was found in the data file. ..Would you like to use it 


(y/n)?. Enter “y” to use the tree. 


. Number of grid points per dimension. This option controls 


how fine the FUBAR analysis is by setting the range of possible 
aN and dS values that can be inferred, along an N x N grid. 
We will use the default value of 20 (leading toa 20 x 20 grid of 
aN/AsS ratios). FUBAR will now pre-compute likelihoods for 
each value in the grid. 


. Number of MCMC chains to run. This option determines 


the number of Markov Chain Monte Carlo chains to run 
during Bayesian inference of evolutionary rates. Enter the 
default value of 5 to run 5 chains. 


. The length of each chain. This option controls for how long 


each MCMC chain should be run. Enter the default value of 
2000000 to run each chain for two million generations (thus 
obtaining two million samples). 


. Use this many samples as burn-in. This option determines 


how many initial samples drawn from the MCMC chain should 
be discarded as burn-in, as is standard in Bayesian analyses. 
Enter the default value of 1000000, leading to a final value of 
one-million draws per chain. 


. How many samples should be drawn from each chain. This 


option determines the final number of samples to draw from 
the full set of one-million draws per chain. Enter the default 
value of 100. 


. The concentration parameter of the Dirichlet prior. This 


option controls the shape of the Dirichlet prior distribution. 
Enter the default value of 0.5. 


Listing 7 Partial FUBAR screen output: 


Lë Tabulating 


site - level results 


|Partition| alpha | beta | N.eff |Posterior prob for positive selection| 
eee pole eer ec ey Ae ye ae dudes E a ad eaese | 
1 | 0.753 | 4.365| 64.549 | Pos. posterior = 0.9262 
1 | 0.753 | 3.920] 77.106 | Pos. posterior = 0.9095 | 
1 | 0.730 | 4.447| 64.182 | Pos. posterior = 0.9325 
1 | 0.637 | 6.595] 53.312 | Pos. posterior = 0.9826 
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| 208 | 1 | 0.622 | 5.908] 55.794 | Pos. posterior = 0.9731 | 
| 242 | 1 | 2.215 | 12.055| 1489.879 | Pos. posterior = 0.9131 | 
## FUBAR inferred 6 sites subject to diversifying positive selection at 


posterior probability >= 0.9 
of these , 0.36 are expected to be false positives (95% confidence interval 
of 0-2 ) 


Like other site analyses, FUBAR will print a number of infer- 
ences about each individual site detected to be under pervasive 
positive selection 


alpha The posterior estimate of the synonymous 
substitution rate at a site 

beta The posterior estimate of the 
non-synonymous substitution rate at a site 

N.eff An estimate of the effective sample size for 


inferring positive selection at this site; smaller 
values (e.g., < 20) imply that the MCMC 
procedure may have failed to sample the 
parameter space well, and longer chains 
(or more chains) might be warranted 
Posterior prob The estimated posterior probability for per- 
for positive selection vasive diversifying selection (dN/dS > 1). 


Interpreting Results: 

Sites identified as positively selected by each method, across all 
three datasets, are given in Table 1. In general, we expect 
MEME to be the most comprehensive and robust of all site-level 
methods because it uniquely considers both pervasive and epi- 
sodic selection [24]. In addition, power studies have shown that 
FUBAR 1s expected to outperform FEL and SLAC under most 
circumstances [25]. Finally, we expect that SLAC will be the 
least robust method due to its reliance on a relatively naive 
counting-based approach [12]. 


These expectations are generally borne out in the results 
obtained here in our brief study of H3 selection. For the full H3 
dataset of 2555 sequences, MEME identified 16 sites, and FEL 
identified 15 sites under positive selection. All sites were identical 
except for the following: MEME uniquely identified sites 151 and 
208, and FEL uniquely identified with 237. Interestingly, site 
208 was additionally identified as positively selected by all methods 
on the trunk H3 dataset. Combined, these results demonstrate 
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MEME?’s ability to identify sites subject to both pervasive and 
episodic selection, as site 208 appears to be under pervasive selec- 
tion only along the H3 trunk. Because FEL uses a less stringent test 
statistic distribution (yj) to call significance, occasionally sites 
subject to pervasive selection near the significance thresholds may 
be detected by FEL but missed by MEME (ee. site 237, with FEL 
reporting P = 0.08 and MEME reporting P = 0.105). 

FUBAR identified two fewer selected sites in the full H3 align- 
ment compared to FEL (which is a directly comparable test), 
missing sites 19 (posterior 0.83), 277 (posterior 0.59), and 
292 (posterior 0.89) relative to FEL, but adding site 160 (FEL 
P= 0.8). 

In addition to differences across methods, we expect to see 
some important differences for sites inferred across the full, shallow, 
and trunk H3 datasets. Because the trunk and full H3 datasets span 
similar time frames, we expect sites returned for these two datasets 
to have the most overlap. In addition, sites found to be under 
selection in the shallow lineage may not be detected across the 
full H3 phylogeny, as selection may have been fleeting, weak, or 
constrained to the specific shallow clade examined here. For exam- 
ple, site 49 was specifically selected in the shallow H3 lineage alone, 
as indicated by three of the four methods. In contrast, sites 19 and 
241 were found to be selected in both the shallow and the full H3 
datasets, but this signal was not apparent when the trunk lineage 
was examined independently, perhaps because these sites experi- 
ence only transient changes that do not propagate along the trunk. 

What are some potential reasons for seeing discrepancies in 
inferences across H3 datasets? The site 154, for example, is posi- 
tively selected in both the full H3 phylogeny and the trunk H3 
lineage, but not the shallow H3 lineage. This result suggests that 
site 154 may have experienced pervasive selection throughout H3 
evolution, but its signal in the shallow clade alone was either too 
weak to detect or selection was attenuated in the shallow clade. In 
addition, sites which appeared only in the shallow clade analyses 
may have experienced lineage-specific selection where the signal 
was too weak to detect when the entire phylogeny was considered. 

Furthermore, while MEME, FEL, and FUBAR were able to 
detect selected sites in the shallow H3 lineage, SLAC did not 
identify any such sites. This is because SLAC requires a large 
number of substitutions, which are unlikely to have occurred in 
the shallow sample, to achieve significance. Overall, we emphasize 
that in many cases different site-level methods will not identify 
exactly the same set of sites under selection, although, as the H3 
example shows, the agreement between is typically good. 
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Rules of Thumb for Site-Level Detection of Selection 
l. Small datasets, i.e., < 10 sequences (especially when coupled 


with low divergence), are unlikely to yield any sites under 
selection. Consider using gene-wide methods like BUSTED 
or aBSREL to look for selection in these cases. 


. On large datasets (e.g., > 500 sequences), all methods tend to 


give similar results (but see the MEME exception below), 
hence the default method of choice is FUBAR, since its run 
time is dramatically shorter than FEL or MEME, and its statis- 
tical performance is better than SLAC. 


. MEME tends to be the most sensitive method, because it is the 


only one designed to detect episodic selection. Indeed, some- 
times SLAC, FEL, or FUBAR may all call a site subject to 
episodic positive selection site negatively selected, if a burst of 
selection is followed by strong conservation. MEME is often 
able to tease the two processes apart and correctly call such sites 
positively selected. Hence, MEME should be the preferred 
method, unless computationally prohibitive. 


4. We cannot universally recommend running all the available 


methods on a given dataset and then aggregating the results, 
as done in Table 1, for several reasons. Firstly, while it may be 
tempting to use agreement between all methods as a hedge 
against false positives, i.e., calling a site selected only if all the 
methods agreed on it, reduces the power of the analysis to that 
of the least sensitive method. Secondly, while comparing the 
sites on which methods disagree can potentially reveal critical 
information (e.g., a site detected by MEME but not FUBAR 
may be under strong episodic selection), considerable effort 
and diligence must be put into disentangling meaningful 
biological differences from statistical artifacts. Thirdly, statisti- 
cal strategy must be informed before the analysis commences 
by deciding which is more important to optimize: does one 
care more about specificity (reducing false positives) or sensi- 
tivity (reducing false negatives)? For example, if little is known 
about a gene, it may be advisable to generate the most inclusive 
list of sites that could be subject to selection for subsequent 
testing using other approaches; in this case, the most sensitive 
method or the union of all methods may be appropriate. 


5. We strongly recommend against performing multiple testing or 


false discovery rate correction on individual site results. Firstly, 
methods are calibrated to not generate excessive false positives 
on strictly neutral data. In most genes, most sites will be under 
relatively strong negative selection, making the statistical test- 
ing procedure conservative. Secondly, multiple testing 


3.6 Screening 
Sequences for 
Recombination 


3.7 GARD 
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corrections will nearly always yield no significant results on 
small to moderate sized datasets. Thirdly, some key assump- 
tions of methods for correcting false discovery rates are not 
applicable for site-level testing. For example, a typical collec- 
tion of results from site-level testing will contain very few, if 
any, true sites with P-values supporting neutrality (dN/ 
aS =1). 


A critical aspect of sequence analysis we have not yet covered is the 
detection of and correction for intragenic recombination in an 
alignment of homologous sequences. Because recombination is 
such a key biological process in many viral pathogens, we strongly 
advocate screening an alignment for recombination before pro- 
ceeding with additional analyses, unless there is a sound biological 
reason to discount (i.e., intragenic recombination Influenza A is 
negligibly rare). Indeed, because recombination causes different 
regions of an alignment to be related by different phylogenies, its 
presence can heavily influence selection detection and other down- 
stream applications. 

There are many computational approaches to finding evidence 
of recombination in a sequence alignment [32], however at their 
core, many such methods look for evidence of phylogenetic incon- 
gruence. Here, we demonstrate one such method, GARD (genetic 
algorithms for recombination detection) that we have found to 
perform very well among a wide range of approaches on simulated 
data [14]. Note that at this time, GARD will not produce a JSON 
file as output but instead several text files containing inference 
information, as well as a final partitioned alignment tor downstream 
use if recombination was detected. 


What Biological Question Is the Method Designed to 
Answer?: 

Have sequences in the given alignment undergone recombina- 
tion, and if so what are the recombination breakpoints and 
segment-specific phylogentes? 


Recommended Applications: 


GARD is geared towards mapping the breakpoints and detecting 
segments of the alignment which can be adequately described by a 
single tree topology. Therefore, alignments, particularly alignments 
of viral sequences, should be screened for the presence of recombi- 
nation before performing any selection inference. The NEXUS 
output from GARD can be directly used as input for most down- 
stream selection detection analyses. 
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Statistical Test Procedure: 

GARD employs a genetic algorithm to find a solution to a complex 
optimization problem by mimicking processes of biological evolu- 
tion (mutation, recombination, and selection) in a population of 
competing solutions. In this application of genetic algorithms, we 
are evolving a population of “chromosomes” that specify different 
numbers and locations of recombination breakpoints in the align- 
ment with the objective of detecting topological incongruence, i.e., 
support for different phylogenies by separate regions of the align- 
ment. The “fitness” of each chromosome is determined by using 
maximum likelihood methods to evaluate a separate phylogeny 
for each non-recombinant fragment defined by the breakpoints 
(eg. to the left and to the right of a breakpoint in Fig. 4), and 
computing a goodness of fit (AIC.) for each such model. The genetic 
algorithm searches for the number and placement of breakpoints 
yielding the best AIC. and also reports confidence values for 
inferred breakpoint locations based on the contribution of each 
considered model weighted by how well the model fit the data. For 
computational expedience, the current implementation of GARD 
infers topologies for each segment using neighbor joining [37] 
based on the TN93 pairwise distance estimator [41] and then 
fits a user-specified nucleotide evolutionary model using maximum 
likelihood to obtain AIC, scores. 


l 
A 
R =] 
(0) 


IN 


Fig. 4 Phylogenetic incongruence caused by the presence of a recombinant 
sequence in an alignment. Sequence R is a product of homologous recombina- 
tion between sequences A and B. Phylogenies reconstructed from sequences A, 
B, R and an outgroup sequence (0) will differ based on which part of the 
alignment is being considered. To the left of the breakpoint, R clusters with A, 
whereas to the right of the breakpoint R clusters with B 
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Example Analysis 1: We will demonstrate the use of GARD, as well 
as its benefits for downstream analysis, using a dataset consisting of 13 
Slycoprotein sequences from Cache Valley Fever virus (cv£. tna). We 
will first use GARD to detect recombination in this dataset, and then 
we will process both the GARD-informed data and the original 
alignment (with no recombination assumed) with FEL to see how 
the presence of recombination may confound selection inference. 

Importantly, GARD specifically requires the use of HyPhy’s MPI- 
enabled executable, HYPHYMPI. To run GARD from the command 
line, you will need an operating system with a MPI headers and 
libraries installed so that this executable can be compiled. Here, we 
will describe how to use GARD from the command line, but we 
emphasize that GARD is fully implemented and available on www. 
datamonkey.org and takes the same input options described here. 

To run GARD, open a terminal session and start HYPHYMPI 
in the appropriate MPI environment (e.g., MPIRUN in Open A DI) 
from the command line to launch the HyPhy analysis menu. Enter 
12 (Recombination) and then 1 to reach the GARD analysis menu, 
and supply values for the following prompts: 


1. Nucleotide file to screen: Provide the full path to the dataset 
of interest: /path/to/data/cvf.fna. 


2. Please enter a 6-character model designation (e.g., 010010 
defines HKY85). This option controls which nucleotide sub- 
stitution model is to be used for analysis, using PAUP nota- 
tional shorthand. The six-character shorthand allows the user 
to specify the entire spectrum from F81 (000000) to GTR 
(012345), which we recommend as default option. Provide 
the value 012345 for this prompt. 


3. Rate variation options. This option determines how site-to- 
site rate variation should be modeled. The option None will 
discount site-to-site rate variation, allowing the analysis to run 
several times faster than other options but also creating the risk 
of mistaking rate heterogeneity for recombination. As such, we 
can only recommend this option for extremely small align- 
ments (e, 3-5 sequences). The option General Discrete 
(the default) models rate variation using an N bin general 
discrete distribution, and option Beta-Gamma models rate 
variation using an adaptively discretized distribution, a more 
flexible version of the standard Gamma+4 model. Enter option 
2 to select the General Discrete model. 


4. How many distribution bins [2—32]?. If rate variation was 
selected in the previous step, this option allows the user to 
decide how many different rate classes should be included in 
the model. We recommend using 3 rate classes by default, as both 
General Discrete and Beta-Gamma distributions are flexible 
enough to reliably capture rate variability in the majority of align- 
ments with only a few rate classes. Therefore, enter the value 3. 
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5. Save results to. For this option, provide a full path to the 
output file to which you would like GARD to write results. 
The supplied file name will ultimately contain an HTML-for- 
matted summary of the analysis. HyPhy will generate several 
other files with names obtained by appending suffixes (as in 
<file name>_suffix) to the main result file. In particular, 
the _finalout file stores the original alignment in NEXUS 
format with inferred non-recombinant sections of the alignment 
saved in the ASSUMPTIONS block and trees inferred for each 
partition in the TREES block. This NEXUS file can be input into 
many recombination-aware analyses in HyPhy and other pro- 
grams that can read NEXUS. The _ga_details file contains 
two lines of information about each model examined by the 
genetic algorithm: its AICc score and the location of breakpoints 
in the model. Finally, the _ga_splits file stores information 
about the location of breakpoints and trees inferred for each 
alignment region under the best model found by the GA. 

GARD will now run to completion, printing status indica- 
tors to screen while it runs: 


Listing 8 Partial GARD output: 
Fitting a baseline nucleotide model... 


Done with single partition analysis. Log(L) = -5921.9511901113, c-AIC = 11914.85153276497 
Starting the GA... 


GENERATION 2 with 1 breakpoints (~0% converged) 
Breakpoints c-AIC Delta c -AIC [BP 1] 


0 11914.85 

1 11804.56 110.291 1393 
GA has considered 92/ 328 (92 over all runs) unique models 
Total run time 0 hrs 0 mins 2 seconds 
Throughput 46.00 models/second 


Allocated time remaining 999 hrs 59 mins 58 seconds (approx. 165599908 more models. ) 


GENERATION 52 with 4 breakpoints (~100% converged) 
Breakpoints c-AIC Deltac -AIC [BP 1] [BP 2] [BP 3] [BP 4] 


011914.85 

111804.56 110.291 1445 

211783 .92 20.638 617 1490 

3 11778.94 4.978 587 962 1475 

4 11778.94 0.000 587 962 1475 
GA has considered 268/ 473490550 (1356 over all runs) unique models 
Total run time Ohrs 4 mins 2 seconds 
Throughput 5.60 models/second 


Allocated time remaining 999 hrs 55 mins 58 seconds (approx. 20170544.82644628 more models. ) 


Performing the final optimization... 
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Interpreting Results: 

GARD found evidence of recombination in this dataset with 
three breakpoints, yielding a 135.9 point AIC. improvement 
over the model without recombination. Among all models with 
three breakpoints in the Cache Valley Virus glycoprotein align- 
ment, the best model places them at nucleotides 587, 962, and 
1475. Importantly, if GARD had reported that the best model 
had 0 breakpoints, we could conclude that no evidence of recom- 
bination had been found. Note that because genetic algorithms 
are stochastic, there is no guarantee that replicate runs will 
converge to exactly the same quantitative results. When there is 
a strong signal of recombination breakpoints in the data, how- 
ever, the qualitative results (number and general location of 
breakpoints) should be fairly robust. 


Example Analysis 2: The NEXUS file that GARD produced is a 
partitioned dataset, wherein different groups of sites are described by 
different trees. Most HyPhy selection analyses discussed here,° includ- 
ing MEME, FUBAR, FEL, SLAC, and BUSTED, are able to analyze 
partitioned data. To demonstrate the importance of screening for 
recombination, we will now compare results for a FEL analysis per- 
formed on the original alignment of 13 Cache Valley Virus glycopro- 
teins, as well as on the GARD-inferred partitioned alignment. All 
steps here were carried out as described earlier in this chapter. 


Interpreting Results: 

FEL inference on the GARD-processed partitioned Cache Valley 
Virus data does not detect sites under selection at P < 0.1. By 
contrast, FEL inference on the unpartitioned Cache Valley 
Virus data (Ge, not pre-screened for recombination) detects 
three positively selected sites at P < 0.1 (212, 516, and 558 at 
P = 0.08, P = 0.03, and P = 0.09, respectively). From these 
results, we can clearly tell that not screening or recombination 
has the potential for adverse consequence including an increased 
false positive rate as seen here. As such, we strongly encourage 
users to screen alignments for recombination if such processes are 
suspected before proceeding to selection detection. 


3.8 Accounting for A critical genomic process that one must consider when detecting 
Synonymous Rate selection is the phenomenon of synonymous rate variation, wherein 
Variation the rate of synonymous codon evolution (represented by dS in the 


D Note that neither aBSREL nor RELAX accepts partitioned data because they require a consistent phylogeny to 
define branch sets. 
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context of codon models and representing mutation rate) varies 
across species, genes, and even intragenic positions. In particular, 
intragenic synonymous rate variation has been identified across 
domains of life [11, 20, 30] and can arise from a variety of evolu- 
tionary processes, including selection on mRNA secondary struc- 
ture [2], gene expression [4], GC-biased gene conversion [10], and 
other neutral mutation processes. For example, even the genomic 
context of a given nucleotide can influence its mutation rate; 
indeed, experimental work has shown that GC-neighboring sites 
can feature up to a 75-fold increase in mutation rate [20, 38]. In 
addition, the synonymous rate at certain sites may be elevated due 
to the mutational vulnerability of the non-template DNA strand 
during transcription [20]. These processes must be accounted for 
in order to ensure an appropriate baseline dS is used when testing 
for selection. 

We demonstrate the importance of considering synonymous 
rate variation for selection inference using a dataset of 10 mamma- 
lian CD2 genes, which code for a specific T-cell surface adhesion 
molecule [21]. We use FEL to detect selection in this dataset under 
two specifications: with synonymous rate variation (“yes” in 
prompt 4 in the FEL analysis menu), and without synonymous 
rate variation (“no” in prompt 4 in the FEL analysis menu). 


Interpreting Results: 

At P < 0.1, analysis of CD2 with synonymous rate variation 
revealed a total of 14 sites under positive selection. By contrast, 
CD2 analysis with FEL without dS variation only detected four 
sites under positive selection (Fig. 5). Similarly, analysis with dS 
variation revealed 27 sites under purifying selection, but analy- 
sis without dS variation revealed only 15 sites under purifying 
selection. Most importantly, all sites detected when dS was fixed 
to 1 were a subset of the sites identified by the model with dS 
variation (Fig. 5). Together, these results demonstrate that 
ignoring dS variation can induce both an increased false nega- 
tive rate regarding positive selection detection and an overall 
decrease in power to detect any selective regime. We acknowledge 
that it is possible that the opposite conclusion might be true, 
namely, that additional sites identified by FEL with dS varia- 
tion might instead be false positives. However, in our experience, 
this is much less frequently the case [12]. 


Site 
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Fig. 5 Sites identified as positively (red) and negatively (blue) selected in CD2 at P < 0.1 by FEL run with 
(above the line) and without dS variation (below the line). Sites with arrows represent those identified as 
selected by FEL with dS variation that were not identified by FEL when dS variation was ignored 


4 Tips 


5 Exercises 


Here we provide some helpful notes on HyPhy usage. 


e An actively maintained board for usage questions and filing bug 


reports is available at https://github.com/veg/hyphy/ 
issues. 


Each HyPhy analysis described here will export a JSON file. This 
file can either be uploaded to HyPhy-Vision for visual exami- 
nation, or it can be easily parsed using a standard scripting 
language using standard packages, for example, the json pack- 
age in Python or the jsonLite package in R. All fields used in 
these output files are defined in http://hyphy.org. 


Mac OS(X) users may need to install a new set of compilers (i.e., 
gcc-6) that are compatible with openMP in order to have full 
functionality from the HYPHYMP executable, as is described on 
the HyPhy website. 


l. Earlier, we performed a BUSTED analysis without designating 


a specific subset of test lineages. For this exercise, we will 
analyze the HIV-1 transmission dataset with BUSTED in two 
different ways: testing all branches, and testing only recipient- 
derived HIV-1 sequences. The input data for this exercise, with 
an appropriately labeled phylogeny, is available in exercises/ 
hivl_transmission_exercisel.fna. For select branches 
labeled A11 or test as the test lineages. 


e Is there evidence (compare model fits using the small sample 
AIC) that test branches have a different selective regime 
than the rest of the tree? 


e The entire dataset should provide evidence for episodic 
diversification, but the recipient only analysis should return 
a negative result. What does this mean biologically, i.e., 
where does the selection signal come from? 
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2. Investigate the effect of recombination of site-specific inference 
of episodic selection using MEME. Run MEME on exer- 
cises/cvf.fna (single partition data, i.e., assuming no 
recombination), and then on the same dataset screened for 
recombination using GARD exercises/cvf_gard.nexus, 
testing for selection on all branches, with P=0.1. Compare 
the list of sites detected to be under selection by the two 
analyses. 


e Which analysis generated more positive results? 


e Do you think these results are true or false positives? How 
does this compare to the FEL analysis we described in the 
text? 


e Compare site-wise estimates of substitution rates (eg, æ) 
between the two analyses. Is there a discernible bias intro- 
duced by not accounting for recombination? 


3. When analyzing intraspecies or intrahost data, dN/dS esti- 
mates may be inflated due to the fact that not all observed 
sequence variation are due to substitutions, but some are sim- 
ply mutations that have not yet been filtered by selection 
[17, 23, 31, 35]. In other words, dN/dS may be elevated by 
intraspecies /intrahost polymorphism that should not necessar- 
ily be attributed to positive selection. One simple approach to 
mitigating this undesirable effect is to restrict site-specific ana- 
lyses to Internal branches only. Internal branches are less 
likely to contain spurious polymorphic variants because they 
encompass at least one process on which selection can act (i.e., 
transmission and/or multiple rounds of replication). Apply 
MEME and FEL to an intrahost sample of HIV-1 sequences, 
found in exercises/JS1774.nex, from an infected individ- 
ual analyzed in Lorenzo-Redondo et al. [19] first choosing to 
test All branches, and next choosing Internal branches. 


4. Compare the lists of selected sites between All/Internal ana- 
lyses. How different are they? 


5. Use RELAX to formally test whether or not selective regimes 
(adN/dS distributions) are different between terminal and 
internal branches in exercises/JS1774.nex. 
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Evolution of Protein Domain Architectures 


Sofia K. Forslund, Mateusz Kaduk, and Erik L. L. Sonnhammer 


Abstract 


This chapter reviews current research on how protein domain architectures evolve. We begin by summariz- 
ing work on the phylogenetic distribution of proteins, as this will directly impact which domain architec- 
tures can be formed in different species. Studies relating domain family size to occurrence have shown that 
they generally follow power law distributions, both within genomes and larger evolutionary groups. These 
findings were subsequently extended to multi-domain architectures. Genome evolution models that have 
been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective 
pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial 
propensity, and the effects of this have been studied using measures of domain versatility or promiscuity. 
Next, we study the principles of protein domain architecture evolution and how these have been inferred 
from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain 
architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn 
from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to 
have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a 
discussion of some available tools for computational analysis or exploitation of protein domain architectures 
and their evolution. 


Key words Protein domain, Protein domain architecture, Superfamily, Monophyly, Polyphyly, Con- 
vergent evolution, Domain evolution, Kingdoms of life, Domain co-occurrence network, Node degree 
distribution, Power law, Parsimony 


1 Introduction 


1.1 Overview By studying the domain architectures of proteins, we can under- 
stand their evolution as a modular phenomenon, with high-level 
events enabling significant changes to take place in a time span 
much shorter than required by point mutations only. This research 
field has become possible only now in the -omics era of science, as 
both identifying many domain families in the first place and acquir- 
ing enough data to chart their evolutionary distribution require 
access to many completely sequenced genomes. Likewise, the con- 
clusions drawn generally consider properties averaged for entire 
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1.2 Protein Domains 


1.3 Domain 
Databases 


species or organism groups or entire classes of proteins, rather than 
properties of single genes. 

We will begin by introducing the basic concepts of domains and 
domain architectures, as well as the biological mechanisms by 
which these architectures can change. The remainder of the chapter 
is an attempt at answering, from the recent literature, the question 
of which forces shape domain architecture evolution and in what 
direction. The underlying issue concerns whether it is fundamen- 
tally a random process or whether it is primarily a consequence of 
selective constraints. We end by outlining some available software 
tools and resources for analysis of domain architectures and their 
evolution. 


Protein domains are high-level parts of proteins that either occur 
alone or together with partner domains on the same protein chain. 
Most domains correspond to tertiary structure elements and are 
able to fold independently. All domains exhibit evolutionary con- 
servation, and many either perform specific functions or contribute 
in a specific way to the function of their proteins. The word domain 
strictly refers to a distinct region of a specific protein, an instance of 
a domain family. However, domain and domain family are often 
used interchangeably in the literature. 


By identifying recurring elements in experimentally determined 
protein 3D structures, the various domain families in structural 
domain databases such as SCOP [1] and CATH [2] were gathered. 
New 3D structures allow assignment to these classes from semiau- 
tomated inspection. The SUPERFAMILY [3] database assigns 
SCOP domains to all protein sequences by matching them to 
hidden Markov models (HMMs) that were derived from SCOP 
superfamilies, i.e., proteins whose evolutionary relationship is evi- 
denced structurally. The Gene3D [4] database is similarly con- 
structed but based on domain families from CATH. 

This approach resembles the methodology used in pure 
sequence-based domain databases such as Pfam [5]. In these data- 
bases, conserved regions are identified from sequence analysis and 
background knowledge, to make multiple sequence alignments. 
From these, HMMs are built that are used to search new sequences 
for the presence of the domain represented by each HMM. All such 
instances are stored in the database. The HMM framework ensures 
stability across releases and high quality of alignments and domain 
family memberships. The stability allows annotation to be stored 
along with the HMMs and alignments. The InterPro database [6] is 
a meta-database of domains combining the assignments from sev- 
eral different source databases, including Pfam. The Conserved 
Domain Database (CDD) is a similar meta-database that also con- 
tains additional domains curated by the NCBI [7]. SMART [8] is a 
manually curated resource focusing primarily on signaling and 


1.4 Domain 
Architectures 


1.5 Mechanisms for 
Domain Architecture 
Change 
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extracellular domains. ProDom [9] is a comprehensive domain 
database automatically generated from sequences in UniProt 
[10]. Likewise, ADDA [11] is automatically generated by cluster- 
ing subsequences of proteins from the major sequence databases, 
though it has not been updated for some time. Genome3D [12] isa 
recent consensus database which brings together several domain 
prediction tools as well as the SCOP and CATH databases for 
describing representative domain arrangements in a series of 
trusted, well-annotated genomes. 

Since the domain definitions from different databases only 
partially overlap, results from analyses often cannot be directly 
compared. In practice, however, choice of database appears to 
have little effect on the main trends reported by the studies 
described here. 


The terms “domain architecture” or “domain arrangement” gen- 
erally refer to the domains in a protein and their order, reported in 
N- to C-terminal direction along the amino acid chain. Another 
recurring term is domain combinations. This refers to pairs of 
domains co-occurring in proteins, either anywhere in the protein 
(the “bag-of-domains” model) or specifically pairs of domains 
being adjacent on an amino acid chain, in a specific N- to 
C-terminal order [13]. The latter concept is expanded to triplets 
of domains, which are subsequences of three consecutive domains, 
with the N- and C-termini used as “dummy” domains. A domain X 


occurring on its own in a protein thus produces the triplet N-X- 
C [14]. 


Most mutations are point mutations: substitutions, insertions, or 
deletions of single nucleotides. While conceivably enough of these 
might create a new domain from an old one or noncoding sequence 
or remove a domain from a protein, in practice we are interested in 
mechanisms whereby the domain architecture of a protein changes 
instantly or nearly so (but see below for an overview of recent work 
on the origin of new domains). Figure 1 shows some examples of 
ways in which domain architectures may mutate. In general, adding 
or removing domains requires genetic recombination events. These 
can occur either through errors made by systems for repairing DNA 
damage such as homologous [16, 17] or nonhomologous (illegiti- 
mate) [18, 19] recombination or through the action of mobile 
genetic elements such as DNA transposons [20] or retrotranspo- 
sons [21, 22]. Recombination can cause loss or duplication of parts 
of genes, entire genes or much longer chromosomal regions. 

In organisms that have introns, exon shuffling [23, 24] refers to 
the integration of an exon from one gene into another, for instance, 
through chromosomal crossover, gene conversion, or mobile 
genetic elements. Exons could also be moved around by being 
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Fig. 1 Examples of mutations that can change domain architectures. Adapted from Buljan et al. [25]. (a) Gene 
fusion by a mobile element. LINE refers to a Long Interspersed Nuclear repeat Element, a retrotransposon. The 
reverse transcriptase encoded within the LINE causes its mRNA to be reverse-transcribed into DNA and 
integrated into the genome, making the domain-encoding blue exon from the donor gene integrate along with 
it in the acceptor gene. (b) Gene fusion by loss of a stop signal or deletion of much of the intergenic region. 
Genes 1 and 2 are joined together into a single, longer gene. (c) Domain insertion through recombination. The 
blue domain from the donor gene is inserted within the acceptor gene by either homologous or illegitimate 
recombination. (d) Right: Gene fission by introduction of transcription stop (the letter Q) and start (the letter A). 
Left: Domain loss by introduction of a stop codon (exclamation mark) with subsequent degeneration of the 
now untranslated domain 


brought along by mobile genetic elements such as retrotransposons 
[24, 25]. 

Two adjacent genes can be fused into one if the first one loses 
its transcription stop signals. Point mutations can cause a gene to 
lose a terminal domain by introducing a new stop codon, after 
which the “lost” domain slowly degrades through point mutations 
as it is no longer under selective pressure [26]. Alternatively, a 
multi-domain gene might be split into two genes if both a start 
and a stop signal are introduced between the domains. Novel 
domains could arise, for instance, through exonization, whereby 
an intronic or intergenic region becomes an exon, after which 
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subsequent mutations would fine-tune its folding and functional 
properties [25, 27]. 

Recent literature (see, e.g., [28]) has discussed the possibility of 
de novo domain creation through a variety of mutational mechan- 
isms, with some support for this occurring more often than previ- 
ously thought [29, 30]. The majority of such new domains arise as 
novel genes from noncoding sequence but may subsequently 
recombine to join with older domains. Furthermore, young 
domains in vertebrates tend more often to occur at the 
N-terminal of a protein and tend to experience higher relative 
rates of non-synonymous substitution than older domains, which 
may reflect the nature of the mechanisms through which novel 
domains arise. Moore, Bornberg-Bauer et al. explore the relative 
prevalence of domain loss, duplication, and de novo origination in 
arthropods [31] and plants [32], suggesting such novel domains 
most frequently are associated with environmental adaptations. 


2 Distribution of the Sizes of Domain Families 


Domain architectures are fundamentally the realizations of how 
domains combine to form multi-domain proteins with complex 
functions. Understanding how these combinations come to be 
requires first that we understand how common the constituent 
domains of those architectures are and whether there are selective 
pressures determining their abundances. Because of this, the body 
of work concerning the sizes and species distributions of domain 
families becomes important to us. 

Comprehensive studies of the distributions and evolution of 
protein domains and domain architectures are possible as genome 
sequencing technologies have made many entire proteomes avail- 
able for bioinformatic analysis. Initial work [33-35 ] focused on the 
number of copies that a protein family, either single domain or 
multi-domain, has in a species. Most conclusions from these early 
studies appear to hold true for domains, for supra-domains (see 
below) and for domain architectures [36-38]. In particular, these 
all exhibit a dominance of the population by a selected few [35], i.e., a 
small number of domain families are present in a majority of the 
proteins in a genome, whereas most domain families are found only 
in a small number of proteins. 

Looking at the frequency N of families of size X (defined as the 
number of members in the genome), in the earliest studies, this 
frequency was modeled as the power law 


N=cX” 


where ois an exponent parameter. The power law is a special case of 
the generalized Pareto distribution (GPD) [39]: 
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N=c(it+ Ai" 


Power law distributions arise in a vast variety of contexts: from 
human income distributions, connectivity of internet routers, word 
usage in languages, and many other situations ([34, 35, 40, 41], see 
also [42], for a conflicting view). Luscombe et al. [35] described a 
number of other genomic properties that also follow power law 
distributions, such as the occurrence of DNA “words,” pseudo- 
genes, and levels of gene expression. These distributions fit much 
better than the alternative they usually are contrasted against, an 
exponential decay distribution. The most important difference 
between exponential and power law distributions in this context 
concerns the fact that the latter has a “fat tail,” that is, while most 
domain families occur only a few times in each proteome, most 
domains in the proteome still belong to one of a small number of 
families. 

Later work ([39, 43], see also [44]) demonstrated that 
proteome-wide domain occurrence data fit the general GPD better 
than the power law but that it also asymptotically fits a power law as 
X > 1. The deviation from strict power law behavior depends on 
proteome size in a kingdom-dependent manner [43]. Regardless, it 
is mostly appropriate to treat the domain family size distribution as 
approximately (and asymptotically) power law-like, and later stud- 
ies typically assume this. 

The power law, but not the GPD, is scale-free in the sense of 
fulfilling the condition 


flax) = g(a)f (x) 


where fx) and g(x) are some functions ofa variable x and where a is 
a scaling parameter, that is, studying the data at a different scale will 
not change the shape of function. This property has been exten- 
sively studied in the literature and is connected to other attributes, 
notably when it occurs in network degree distributions (e, fre- 
quency distributions of edges per node). Here it has been asso- 
ciated with properties such as the presence of a few central and 
critical hubs (nodes with many edges to other nodes), the similarity 
between parts and the whole (as in a fractal), and the growth 
process called preferential attachment, under which nodes are 
more likely to gain new links the more links they already have. 
However, the same power law distribution may be generated 
from many different network topologies with different patterns of 
connectivity. In particular, they may differ in the extent that hubs 
are connected to each other [42]. It is possible to extend the 
analysis by taking into account the distribution of degree pairs 
along network edges, but this is normally not done. 

What kind of evolutionary mechanisms give rise to this kind of 
distribution of gene or domain family sizes within genomes? In one 
model by Huynen and van Nimwegen [33], every gene within a 
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gene family will be more or less likely to duplicate, depending on 
the utility of the function of that gene family within the particular 
lineage of organisms studied, and they showed that such a model 
matches the observed power laws. While they claimed that any 
model that explains the data must take into account family-specific 
probabilities of duplication fixation, Yanai and coworkers [45] 
proposed a simpler model using uniform duplication probability 
for all genes in the genome and also reported a good fit with data. 

Later, more complex birth-death [43] and birth-death-and- 
innovation (BDIM) [29, 34, 39, 46] models were introduced to 
explain the observed distributions, and from investigating which 
model parameter ranges allow this fit, the authors were able to draw 
several far-ranging conclusions. First, the asymptotic power law 
behavior requires that the rates of domain gain and loss are asymp- 
totically equal. Karev et al. [39] interpreted this as support for a 
punctuated equilibrium-type model of genome evolution, where 
domain family size distributions remain relatively stable for long 
periods of time but may go through stages of rapid evolution, 
representing a shift between different BDIM evolutionary models 
and significant changes in genome complexity. Like Huynen and 
van Nimwegen [33], they concluded that the likelihood of fixated 
domain duplications or losses in a genome directly depend on 
family size. The family will however only grow as long as new copies 
can find new functional niches and contribute to a net benefit for 
survival, i.e., as long as selection favors it. 

Aside from Huynen and van Nimwegen’s, none of the models 
discussed depend very strongly on family-specific selection to 
explain the abundances of individual gene families, nor do they 
exclude such selection. Some domains may be highly useful to 
their host organism’s lifestyle, such as cell-cell connectivity domains 
to an organism beginning to develop multicellularity. Expansion of 
these domain families might therefore become more likely in some 
lineages than in others. To what extent these factors actually affect 
the size of domain families remains to be fully explored. Karev et al. 
[39] suggested that the rates of domain-level change events them- 
selves—domain duplication and loss rates, as well as the rate of 
influx of novel domains from other species or de novo creation— 
must be evolutionarily adapted, as only some such parameters allow 
the observed distributions to be stable. Van Nimwegen [47 | inves- 
tigated how the number of genes increases in specific functional 
categories as total genome size increases. He found that the rela- 
tionship matches a power law, with different coefficients for each 
functional class remaining valid over many bacterial lineages. Ranea 
et al. found similar results. Also, Ranea et al. [48] showed that, for 
domain superfamilies inferred to be present in the last universal 
common ancestor (LUCA), domains associated with metabolism 
have significantly higher abundance than those associated with 
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translation, further supporting a connection between the function 
of a domain family and how likely it is to expand. 

Extending the analysis to multi-domain architectures, Apic 
et al. [37] showed that the frequency distribution of multi-domain 
family sizes follows a power law curve similar to that reported for 
individual domain families. It therefore seems likely that the basic 
underlying mechanisms should be similar in both cases, i.e., that 
duplication of genes, and thus their domain architectures, is the 
most important type of event affecting the evolution of domain 
architectures. 

Have the trends described above stood the test of time as more 
genomes have been sequenced and more domain families have been 
identified? We considered the 1943 UniProt proteomes covered by 
version 30.0 of Pfam, plotted the frequency Yof domain families 
that have precisely X members as a function of X, and fit a power 
law curve to this. Figure 2a shows the resulting plots for three 
representative species, one complex eukaryote (Homo sapiens), 
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Fig. 2 (a) Distribution of domain family sizes in three selected species. Power law distributions were fitted to 
these curves such that for frequency fof families of size X, f= cX*. For S. cerevisiae, a= — 1.9, for E coli, 
a= — 1.7, and for H sapiens, a= — 1.5. (b) Distribution of domain family sizes across the three kingdoms. 
Power law distributions were fitted to these curves such that for frequency fof families of size X, f= cX*. For 
bacteria, a = —0.9, for archaea, a = — 1.1, for eukaryotes, a = — 0.8, and for viruses, a = — 1.9 
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one simple eukaryote (Saccharomyces cerevisiae), and one prokary- 
ote (Escherichia coli). Figure 2b shows the corresponding plots for 
all domains in all complete eukaryotic, bacterial, and archaeal pro- 
teomes. The power law curve fits decently well, with slopes becom- 
ing less steep for the more complex organisms, whose distributions 
have relatively more large families. The power law-like behavior 
suggests that complex organisms with large proteomes were 
formed by heavily duplicating domains from relatively few families. 
Figures 3a, b show equivalent plots, not for single domains but for 
entire multi-domain architectures. The curve shapes and the rela- 
tionship between both species and organism groups are similar, 
indicating that the evolution of these distributions have been 
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Fig. 3 (a) Distribution of multi-domain (architecture) family sizes in three selected species. Power law 
distributions were fitted to these curves such that for frequency f of families of size X, f = cX*. For 
S. cerevisiae, a = —2.0, for E. coli, a = — 1.8, and for H sapiens, a = — 1.5. (b) Distribution of multi- 
domain (architecture) family sizes across the three kingdoms. Power law distributions were fitted to these 
curves such that for frequency fof families of size X, f= cX*. For bacteria, a = — 1.0, for archaea, a = — 1.1, 
for eukaryotes, a = — 1.1, and for viruses, a= —2.0 
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3 Kingdom and Age Distribution of Domain Families and Architectures 


How old are specific domain families or domain architectures? With 
knowledge of which organism groups they are found in, it is possi- 
ble to draw conclusions about their age and whether lineage- 
specific selective pressures have determined their kingdom-specific 
abundances. Domain families and their combinations have arisen 
throughout evolutionary history, presumably by new combinations 
of pre-existing elements that may have diverged beyond recogni- 
tion or by processes such as exonization. We can estimate the age of 
a domain family by finding the largest clade of organisms within 
which it is found, excluding organisms with only xenologs, i.e., 
horizontally transferred genes [14]. The age of this lineage’s root is 
the likely age of the family. The same holds true for domain com- 
binations and entire domain architectures. This methodology 
allows us to determine how changing conditions at different points 
in evolutionary history, or in different lineages, have affected the 
evolution of domain architectures. 

Apic et al. [36] analyzed the distribution of SCOP domains 
across 40 genomes from archaea, bacteria, and eukaryotes. They 
found that a majority of domain families are common to all three 
kingdoms of life and thus likely to be ancient. Kuznetsov et al. [43] 
performed a similar analysis using InterPro domains and found that 
only about one fourth of all such domains were present in all three 
kingdoms, but a majority was present in more than one of them. 
Lateral gene transfer or annotation errors can cause a domain family 
to be found in one or a few species in a kingdom without actually 
belonging to that kingdom. To counteract this, one can require 
that a family must be present in at least a reasonable fraction of the 
species within a kingdom for it to be considered anciently present 
there. For instance, using Gene3D assignments of CATH domains 
to 114 complete genomes, mainly bacterial, Ranea et al. [48] 
isolated protein superfamily domains that were present in at least 
90% of all the genomes and at least 70% of the archaeal and 
eukaryotic genomes, respectively. Under these stringent cutoffs 
for considering a domain to be present in a kingdom, 140 domains, 
15% of the CATH families found in at least one prokaryote 
genome, were inferred to be ancient. Chothia and Gough [49] 
performed a similar study on 663 SCOP superfamily domains 
evaluated at many different thresholds and found that while 
516 (78%) superfamilies were common to all three kingdoms at a 
threshold of 10% of species in each kingdom, only 156 (24%) 
superfamilies were common to all three kingdoms at a threshold 
of 90%. They also showed that for prokaryotes, a majority of 
domain instances (e, not domain families but actual domain 
copies) belong to common superfamilies at all thresholds below 
90%. 
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Extending to domain combinations, Apic et al. [36] reported 
that a majority of SCOP domain pairs are unique to each kingdom 
but also that more kingdom-specific domain combinations than 
expected were composed only of domain families shared between 
all three kingdoms. This would imply a scenario where the inde- 
pendent evolution of the three kingdoms mainly involved creating 
novel combinations of domains that existed already in their com- 
mon ancestor. 

Several studies have reported interesting findings on domain 
architecture evolution in lineages closer to ourselves: in metazoa 
and vertebrates. Ekman et al. [50] claimed that new metazoa- 
specific domains and multi-domain architectures have arisen 
roughly once every 0.1—1 million years in this lineage. According 
to their results, most metazoa-specific multi-domain architectures 
are a combination of ancient and metazoa-specific domains. The 
latter category are however mostly found as novel single-domain 
proteins. Much of the novel metazoan multi-domain architectures 
involve domains that are versatile (see below) and exon-bordering 
(allowing for their insertion through exon shuffling). The novel 
domain combinations in metazoa are enriched for proteins asso- 
ciated with functions required for multicellularity—regulation, sig- 
naling, and functions involved in newer biological systems such as 
immune response or development of the nervous system, as previ- 
ously noted by Patthy [23]. They also showed support for exon 
shuffling as an important mechanism in the evolution of metazoan 
domain architectures. Itoh et al. [51] added that animal evolution 
differs significantly from other eukaryotic groups in that lineage- 
specific domains played a greater part in creating new domain 
combinations. Nasir et al. [52] analyzed the age and taxonomic 
distribution of domains drawing on species phylogenies recon- 
structed from domain repertoires, concluding among other things 
that most widespread domains are relatively old and suggesting 
high numbers of both domain gain and loss in the evolution of 
the three organismal superkingdoms. Bacterial and archaeal genes 
have tended to gain or lose domains encoding aspects of metabolic 
capacity, whereas those of eukaryotes—including multicellular 
ones—have gained domains enabling more elaborate extracellular 
processes such as immunity and regulatory capacities. 

In the most recent datasets, what is the distribution of domains 
and domain combinations across the three kingdoms of life? Look- 
ing at the set of UniProt proteomes represented in version 30.0 of 
Pfam, the distribution of domains across the three kingdoms are as 
displayed in the Venn diagram of Fig. 4a. Figure 4b, c show the 
equivalent distributions of immediate neighbors and triplets of 
domains, respectively, and Fig. 4d the distribution of multi-domain 
architectures across kingdoms. The numbers are somewhat biased 
toward bacteria as 56% of the UniProt proteomes are from this 
kingdom. However, with this high coverage of all kingdoms 
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Fig. 4 (a) Kingdom distribution of unique domains. Values are given as percentages of the total, 10,330 
domains. (b) Kingdom distribution of unique domain pairs. Values are given as percentages of the total, 31,287 
domain pairs. (c) Kingdom distribution of unique domain triplets. Values are given as percentages of the total, 
33,662 domain triplets. (d) Kingdom distribution of unique multi-domain architectures. Values are given as 
percentages of the total, 23,238 multi-domain architectures 


(506 eukaryotic, 94 archaeal, and 1090 bacterial proteomes, as well 
as 253 viral entities), the results should be robust in this respect. 
Compared to most previous reports, we see a striking difference in 
that a much smaller portion of domains are shared between all 
kingdoms. There are some potential artifacts which could affect 
this analysis. If lateral gene transfer is very widespread, we may 
overestimate the number of families present in all three kingdoms. 
Moreover, there are cases where separate Pfam families are actually 
distant homologs of each other, which could lead to underestima- 
tion of the number of ancient families. To counteract this, we make 
use of Pfam clans, considering domains in the same clan to be 
equivalent. While not all distant homologies have yet been 
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registered in the clan system, performing the analysis on the clan 
level reduces the risk of such underestimation. 

Our finding that 10% of all Pfam-A domains are present in all 
three main kingdoms is strikingly lower than in the earlier works 
and is even lower than reported by Ranea et al. [48], who used very 
stringent cutoffs. However, a direct comparison of statistics for 
Pfam domains/clans and CATH superfamilies is difficult. The 
decrease in ancient families that we observe may be a consequence 
of the massive increase in sequenced genomes and/or that the 
recent growth of Pfam has added relatively more kingdom-specific 
domains. We further found that only 1.5% of all domains or domain 
combinations are unique to archaea, suggesting that known repre- 
sentatives of this lineage have undergone very little independent 
evolution and/or that most archaeal gene families have been hori- 
zontally transferred to other kingdoms. The trend when going 
from domain via domain combinations to whole architectures is 
clear—the more complex patterns are less shared between the king- 
doms. In other words, each kingdom has used a common core of 
domains to construct its own unique combinations of multi- 
domain architectures. 


4 Domain Co-occurrence Networks 


A multi-domain architecture connects individual domains with 
each other. There are several ways to derive these connections and 
quantify the level of co-occurrence. The simplest method is to 
consider all domains on the same amino acid chain to be connected, 
but we can also limit the set of co-occurrences we consider to, e.g., 
immediate neighbor pairs or triplets. Regardless of which method is 
used, the result is a domain co-occurrence network, where nodes 
represent domains and where edges represent the existence of 
proteins in which members of these families co-occur. Figure 5 
shows an example of such a network and the set of domain archi- 
tectures which defines it. This type of explicit network representa- 
tion is explored in several studies, notably by Itoh et al. [51], 
Przytycka et al. [53], and Kummerfeld and Teichmann [13]. It is 
advantageous as it allows the introduction of powerful analysis tools 
developed within the engineering sciences for use with artificial 
network structures such as the World Wide Web. The patterns of 
co-occurrences that we observe should be a direct consequence of 
the constraints and conditions under which domain architectures 
evolve, and because of this, the study of these patterns becomes 
relevant for understanding such factors. 

The frequency distribution of node degrees in the domain 
co-occurrence network has been fitted to a power law [36] and a 
more general GPD as well [40]. The closer this approximation 
holds, the more the network will have the scale-free property. 
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Fig. 5 Example of protein domain co-occurrence network, adapted from Kum- 
merfeld and Teichmann [13]. (a) Sample set of domain architectures. The lines 
represent proteins and the boxes their domains in N- to C-terminal order. (b) 
Resulting domain co-occurrence (neighbor) network. Nodes correspond to 
domains and are linked by an edge if at least one domain exists where the 
two domains are found adjacent to each other along the amino acid chain 


This property can be thought of as a hierarchy in the network, 
where the more centrally connected nodes link to more peripheral 
nodes with the same relative frequency at each level. In the context 
of domains, this means that a small number of domains co-occur 
with a high number of other domains, whereas most domains only 
have a few neighbors—usually some of the highly connected hubs. 
The most highly connected domains are referred to as promiscuous 
[54], mobile, or versatile [14, 55, 56]. Many such hub domains are 
involved in intracellular or extracellular signaling, protein-protein 
interactions and catalysis, and transcription regulation. In general, 
these are domains that encode a generic function, e.g., phosphory- 
lation, which is reused in many contexts by additional domains that 
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Table 1 

The 20 most densely connected hubs with regard to immediate domain neighbors, according to Pfam 

30.0 

Number of different immediate 
Identifier Name neighbors 
CL0023 P-loop containing nucleoside triphosphate hydrolase 415 
superfamily 
CL0063 FAD/NAD(P)-binding Rossmann fold superfamily 390 
CL0123  Helix-turn-helix clan 358 
CL0016 Protein kinase superfamily 192 
CL0159 Ig-like fold superfamily (E-set) 148 
CL0020 ` Tetratricopeptide repeat superfamily 146 
CL0028 Alpha/beta-hydrolase fold 140 
CL0172 Thioredoxin-like 136 
CL0036 Common phosphate-binding site TIM barrel 136 
superfamily 

CL0219 Ribonuclease H-like superfamily 127 
CL0058 Tim barrel glycosyl hydrolase superfamily 120 
CL0257 N-acetyltransferase-like 115 
CL0167 Zinc beta-ribbon 114 
CL0072 Ubiquitin superfamily 112 
CL0125 Peptidase clan CA 106 
CL0186 Beta propeller clan 105 
CL0021 OB fold 101 
CL0192 Family A G protein-coupled receptor-like superfamily 101 
CL0015 Major facilitator superfamily 97 
CL0220  EF-hand-like superfamily 95 


confer substrate specificity or localization. Table 1 shows the 
domains (or clans) with the highest numbers of immediate neigh- 
bors in Pfam 30.0. 

One way of evolving a domain co-occurrence network that 
follows a power law is by “preferential attachment” [53, 57]. This 
means that new edges (corresponding to proteins where two 
domains co-occur) are added with a probability that is higher the 
more edges these nodes (domains) already have, resulting in a 
power law distribution. 

Apic et al. [37] considered a null model for random domain 
combination, in which a proteome contains domain combinations 
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with a probability based on the relative abundances of the domains 
only. They showed that this model does not hold and that far fewer 
domain combinations than expected under it are actually seen. If 
most domain duplication events are gene duplication events that do 
not change domain architecture—or at the very least do not disrupt 
domain pairs—then this finding is not unexpected, nor does it 
require or exclude any particular selective pressure to keep these 
domains together in proteins. There is growing support for the idea 
that separate instances of a given domain architecture in general 
descend from a single ancestor with that architecture [58], with 
polyphyletic evolution of domain architectures occurring only in a 
small fraction of cases [53, 59, 60]. 

Itoh et al. [51] performed reconstruction of ancestral domain 
architectures using maximum parsimony, as described in the next 
section. This allowed them to study the properties of the ancestral 
domain co-occurrence network and thus explore how network 
connectivity has altered over evolutionary time. Among other 
things, they found increased connectivity in animals, particularly 
of animal-specific domains, and suggest that this phenomenon 
explains the high connectivity for eukaryotes reported by Wuchty 
[40]. For non-animal eukaryotes, they reported a correlation 
between connectivity and age, such that older domains had rela- 
tively higher connectivity, with domains preceding the divergence 
of eukaryotes and prokaryotes being the most highly connected, 
followed by early eukaryotic domains. In other words, early eukary- 
otic evolution saw the emergence of some key hub proteins, while 
the most prominent eukaryotic hubs emerged in the animal lineage. 
Parikesit et al. [61] studied the functional annotation of 
co-occurring domains in eukaryotes, concluding that while these 
may have different associated functional descriptors, these descrip- 
tors usually tend to fall within the same overall category within the 
gene ontology. Co-occurring domains thus tend to contribute to 
the same overall process type rather than have very widely divergent 
functional annotations. Hsu et al. [62] constructed a network 
linking domain architectures (i.e., each node is a multi-domain 
architecture, as opposed to in a regular domain co-occurrence 
network) where parsimonious reconstruction suggests evolution 
of one from the other, identifying “highly evolvable” architectures 
as hubs in this network. Proteins with such architectures were 
reported to be more widespread, less often essential, more often 
duplicated, and more often associated with gene functions involved 
in specific adaptation of organisms. 

What is the degree distribution of current domain 
co-occurrence networks? We again used the domain architectures 
from all complete proteomes in version 30.0 of Pfam and consid- 
ered the network of immediate neighbor relationships, i.e., nodes 
(domains) have an edge between them if there is a protein where 
they are adjacent. Each domain was assigned a degree as its number 
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Fig. 6 (a) Distribution of domain co-occurrence network node degrees in three selected species. Power law 
distributions were fitted to these curves such that for frequency f of families of size X, f = cX*. For 
S. cerevisiae, a = —2.2, for E. coli, a = —2.0, and for H sapiens, a = — 1.9. (b) Distribution of domain 
co-occurrence network node degrees across the three kingdoms. This corresponds to a network where two 
domains are connected if any species within the kingdom has a protein where these domains are immediately 
adjacent. Power law distributions were fitted to these curves such that for frequency f of families of size X, 
f= cX*. For bacteria, a = — 1.6, for archaea, a = — 1.7, for eukaryotes, a = — 7.5, and for viruses a= —2.0 


of links to other domains. We then counted the frequency with 
which each degree occurs in the co-occurrence network. Figure 6a 
shows this relationship for the set of domain architectures found in 
the same species as for Figs. 2a, and 6b shows the equivalent plots 
for the three kingdoms as found among the complete proteomes in 
Pfam. Regressions to a power law have been added to the plots. The 
presence of a power law-like behavior of this type implies that few 
domains have very many immediate neighbors, while most domains 
have few immediate neighbors. Note that the observed degrees in 
our dataset were strongly reduced by removing all sequences with a 
stretch longer than 50 amino acids lacking domain annotation. 
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5 Supra-domains and Conserved Domain Order 


As we have seen, whole multi-domain architectures or shorter 
stretches of adjacent domains are often repeated in many proteins. 
These only cover a small fraction of all possible domain combina- 
tions. Are the observed combinations somehow special? We would 
expect selective pressure to retain some domain combinations but 
not others, since only some domains have functions that would 
synergize together in one protein. Often, co-occurring domains 
require each other structurally or functionally, for instance, in 
transcription factors where the DNA-binding domain provides 
substrate specificity, whereas the trans-activating domain recruits 
other components of the transcriptional machinery [63]. Vogel 
et al. [38] identified series of domains co-occurring as a fixed unit 
with conserved N- to C-terminal order but flanked by different 
domain architectures and termed them supra-domains. By investi- 
gating their statistical overrepresentation relative to the frequency 
of the individual domains in the set of nonredundant domain 
architectures (where “nonredundant” is crucial, as otherwise, e.g., 
whole-gene duplication would bias the results), they identified a 
number of such supra-domains. Many ancient domain combina- 
tions (shared by all three kingdoms) appear to be such selectively 
preserved supra-domains. 

How conserved is the order of domains in multi-domain archi- 
tectures? In a recent study, Kummerfeld and Teichmann [13] built 
a domain co-occurrence network with directed edges, allowing it to 
represent the order in which two domains are found in proteins. As 
in other studies, the distribution of node degrees fits a power law 
well. Most domain pairs were only found in one orientation. This 
does not seem required for functional reasons, as flexible linker 
regions should allow the necessary interface to form also in the 
reversed case [58], but may rather be an indication that most 
domain combinations are monophyletic. Weiner and Bornberg- 
Bauer [64] analyzed the evolutionary mechanisms underlying a 
number of reversed domain order cases and concluded that inde- 
pendent fusion/fission is the most frequent scenario. Although 
domain reversals occur in only a few proteins, it actually happens 
more often than was expected from randomizing a co-occurrence 
network [13]. That study also observed that the domain 
co-occurrence network is more clustered than expected by a ran- 
dom model and that these clusters are also functionally more 
coherent than would be expected by chance. 


Evolution of Protein Domain Architectures 487 


6 Domain Mobility, Promiscuity, or Versatility 


While some protein domains co-occur with a variety of other 
domains, some are always seen alone or in a single architecture in 
all proteomes where they are found. A natural explanation is that 
some domains are more likely to end up in a variety of architectural 
contexts than others due to some intrinsic property they possess. Is 
such domain versatility or promiscuity a persistent feature ofa given 
domain, and does it correlate with certain functional or biological 
properties of the domain? 

Several ways of measuring domain versatility have been sug- 
gested. One measure, NCO [40], counts the number of other 
domains found in any architectures where the domain of interest 
is found. Another measure, NN [37], instead counts the number of 
distinct other domains that a domain is found adjacent to. Yet 
another measure, NTRP [65], counts the number of distinct tri- 
plets of consecutive domains where the domain of interest is found 
in the middle. All of these measures can be expected to be higher 
for common domains than for rare domains, i.e., variations in 
domain abundance (the number of proteins a domain is found in) 
can hide the intrinsic versatility of domains. Therefore, three differ- 
ent studies [14, 55, 66] formulated relative domain versatility 
indices that aim to measure versatility independently of abundance. 
It is worth noting that most studies have considered only immedi- 
ately adjacent domain neighbors in these analyses, a restriction 
based on the assumption that those are more likely to interact 
functionally than domains far apart on a common amino acid 
chain. More recent work [67] introduced a network versatility 
metric which can classify domains as being central or peripheral 
with regard to the large-scale structure of their bigram network 
(i.e., the network-linking domains found adjacent in proteins), 
observing how peripheral such domains exhibit relatively higher 
primary sequence conservation suggestive of adaptation to more 
specific functions, whereas the core domains may be more 
multifunctional. 

The first relative versatility study was presented by Vogel et al. 
[66], who used as their domain dataset the SUPERFAMILY data- 
base applied to 14 eukaryotic, 14 bacterial, and 14 archaeal pro- 
teomes. They modeled the number of unique immediate neighbor 
domains as a power law function of domain abundance, performed 
a regression on this data, and used the resulting power law expo- 
nent as a relative versatility measure. Basu et al. [55] used Pfam and 
SMART [8] domains and measured relative domain versatility for 
28 eukaryotes as the immediate neighbor pair frequency normal- 
ized by domain frequency. They then defined promiscuous 
domains as a class according to a bimodality in the distribution of 
the raw numbers of unique domain immediate neighbor pairs. 


488 


Sofia K. Forslund et al. 


Weiner et al. [14] used Pfam domains for 10,746 species in all 
kingdoms and took as their relative versatility measure the logarith- 
mic regression coefficient for each domain family across genomes, 
meaning that it is not defined within single proteomes. 

To what extent is high versatility an intrinsic property of a 
certain domain? Vogel et al. [66] only examined large groups of 
domains together and therefore did not address this question for 
single domains. Basu et al. [55] and Weiner et al. [14] instead 
analyzed each domain separately and concluded that there are 
strong variations in relative versatility at this level. Their results 
are very different in detail, however, reflected by the fact that only 
one domain family (PF00004, AAA ATPase family) is shared 
between the ten most versatile domains reported in the two studies. 
As they used fairly similar domain datasets, it would appear that the 
results strongly depend on the definition of relative versatility. 
Another potential reason for the different results is that Basu’s list 
was based on eukaryotes only, while Weiner’s analysis was heavily 
biased toward prokaryotes. Furthermore, the top ten list in Basu 
et al. [55] and their follow-up paper [56] only overlap by four 
domains, yet the main difference is that in the latter study all 
28 eukaryotes were considered, while the former study was limited 
to the subset of 20 animal, plant, and fungal species. The choice of 
species thus seems pivotal for the results when using this method. 
They also used different methods for calculating the average value 
of relative versatility across many species, which may influence the 
results. 

Does domain versatility vary between different functional clas- 
ses of domains? Vogel et al. [66] found no difference in relative 
versatility between broad functional or process categories or 
between SCOP structural classes. In contrast to this, Basu et al. 
[55] reported that high versatility was associated with certain func- 
tional categories in eukaryotes. However, no test for the statistical 
significance of these results was performed. Weiner et al. [14] also 
noted some general trends but found no significant enrichment of 
gene ontology terms in versatile domains. This does not necessarily 
mean that no such correlation exists, but more research is required 
to convincingly demonstrate its strength and its nature. More 
recently, Cromar et al. [68] analyzed domain architectures in 
eukaryotic extracellular matrix proteomes, noting that these struc- 
tures are organized around a set of versatile domains under the 
weighted bigram metric of Basu et al. [55]. 

Another important question is to what extent domain versatil- 
ity varies across evolutionary lineages. Vogel et al. [66] reported no 
large differences in average versatility for domains in different king- 
doms. The versatility measure of Basu et al. [55] can be applied 
within individual genomes, which means that according to this 
measure domains may be versatile in one organism group but not 
in another, as well as gain or lose versatility across evolutionary 
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time. They found that more domains were highly versatile in ani- 
mals than in other eukaryotes. Modeling versatility as a binary 
property defined for domains in extant species, they further used 
a Maximum parsimony approach to study the persistence of versa- 
tility for each domain across evolutionary time and concluded that 
both gain and loss of versatility are common during evolution. 
Inferring ancestral domain architectures, Cohen-Gihon et al. [69] 
report an increase in versatility in many domains during eukaryotic 
evolution, in particular around the divergence of Bilateria. Weiner 
at al. [14] divided domains into age categories based on distribu- 
tion across the tree of life and reported that the versatility index is 
not dependent on age, i.e., domains have equal chances of becom- 
ing versatile at different times in evolution. This is consistent with 
the observation by Basu et al. [55] that versatility is a fast-evolving 
and varying property. When measuring versatility as a regression 
within different organism groups, Weiner et al. [14] found slightly 
lower versatility in eukaryotes, which is in conflict with the findings 
of Basu et al. [55]. Again, this underscores the strong dependence 
of the method and dataset on the results. 

Further properties reported to correlate with domain versatility 
include sequence length, where Weiner et al. [14] found that 
longer domains are significantly more versatile within the frame- 
work of their study, while at the same time, shorter domains are 
more abundant and hence may have more domain neighbors in 
absolute numbers. Basu et al. [55] further reported that more 
versatile domains have more structural interactions than other 
domains. To determine which of these reported correlations that 
genuinely reflect universal biological trends, further comprehensive 
studies are needed using more data and uniform procedures. This 
would hopefully allow the results from the studies described here to 
be validated and any conflicts between them to be resolved. 

Basu et al. [55] further analyzed the phylogenetic spread of all 
immediate domain neighbor pairs (“bigrams”) containing domains 
classified as promiscuous. The main observation this yielded was 
that although most such combinations occurred in only a few 
species, most promiscuous domains are part of at least one combi- 
nation that is found in a majority of species. They interpreted this as 
implying the existence of a reservoir of evolutionarily stable domain 
combinations from which lineage-specific recombination may draw 
promiscuous domains to form unique architectures. Later work by 
Hsu et al. [70] analyzed the domain co-occurrence networks cen- 
tered on each domain family, classifying such subnetworks as being 
either mostly starlike, taillike, or tetragon-like, with promiscuous 
domains forming cores of starlike architecture networks in this 
representation. 
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7 Principles of Domain Architecture Evolution 


What mutation events can generate new domain architectures, and 
what is their relative predominance? The question can be 
approached by comparing protein domain architectures of extant 
proteins. This is based on the likely realistic assumption that most 
current domain architectures evolved from ancestral domain archi- 
tectures that can still be found unchanged in other proteins. 
Because of this, in pairs of most similar extant domain architectures, 
one can assume that one of them is ancestral. This agrees well with 
results indicating that most groups of proteins with identical 
domain architectures are monophyletic. By comparing the most 
similar proteins, several studies have attempted to chart the relative 
frequencies of different architecture-changing mutations. 

Bjorklund et al. [71] used this particular approach and came to 
several conclusions. First, changes to domain architecture are much 
more common by the N- and C-termini than internally in the 
architecture. This is consistent with several mechanisms for archi- 
tecture changes such as introduction of new start or stop codons or 
mergers with adjacent genes, and similar results have been found in 
several other studies [15, 25, 26]. Furthermore, insertions or dele- 
tions of domains (“indels”) are more common than substitutions of 
domains, and the events in question mostly concern just single 
domains, except in cases with repeats expanding with many 
domains in a row [72]. In a later study, the same group made use 
of phylogenetic information as well, allowing them to infer direc- 
tionality of domain indels [50]. They then found that domain 
insertions are significantly more common than domain deletions. 

Weiner et al. [26] performed a similar analysis on domain loss 
and found compatible results—most changes occur at the termini 
(see also discussion in [28]). Moreover, they demonstrated that 
terminal domain loss seldom involves losing only part of a domain, 
or rather, that such partial losses quickly progress into loss of the 
entire domain. However, it is important to ensure such observa- 
tions are not confounded by cases where errors in gene boundary 
recognition make domain detection less accurate [73 ]. 

There is some support [23, 74, 75] for exon shuffling to have 
played an important part in domain evolution, and there are a 
number of domains that match intron borders well, for example, 
structural domains in extracellular matrix proteins. While it may not 
be a universal mechanism, exon shuffling is suggested to have been 
particularly important for vertebrate evolution [23]. 

Recognizing the potential role of gene duplications in domain 
architecture evolution, Grassi et al. [76] analyzed domain architec- 
ture shifts following either whole-genome duplication (WGD) or 
smaller-scale gene duplication events in yeast. Surviving WGD 
duplicates had retained ancestral architecture in ca 95% of cases, 
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with approximately the same chance of architecture change in 
WGD as under local duplication. Genes retained over time from 
either type of duplication were enriched for a core of commonly 
occurring domains but with a subset of rarer domains additionally 
enriched in retained WGD duplicates compared to locally dupli- 
cated genes. The former category more often was associated with 
housekeeping-type gene functions, whereas the latter more often 
involved adaptive functions. Functional change was generally larger 
than architectural change following duplication. Zhang et al. [77] 
similarly studied domain architecture evolution in plants, noting 
that lineage-specific architecture expansions largely can be 
explained from differential retention of genes following successive 
whole-genome duplications. Another form of domain duplication 
particularly relevant in plants is amplification of the numbers of 
domain repeats in proteins, discussed, e.g., by Sharma and 
Pandey [78]. 


8 Inferring Ancestral Domain Architectures 


The above analyses, based on pairwise comparison of extant protein 
domain architectures, cannot tally ancestral evolutionarily events 
nearer the root of the tree of life. With ancestral architectures, one 
can directly determine which domain architecture changes have 
taken place during evolution and precisely chart how mechanisms 
of domain architecture evolution operate, as well as gauge their 
relative frequency. A drawback is that since we can only infer 
ancestral domain architectures from extant proteins, the result will 
depend somewhat on our assumptions about evolutionary mechan- 
isms. On the upside, it should be possible to test how well different 
assumptions fit the observed modern-day protein domain architec- 
ture patterns. 

Attempts at such reconstructions have been made using parsi- 
mony. Given a gene tree and the domain architectures at the leaves, 
dynamic programming can be used in order to find the assignment 
of architectures to internal nodes that require the smallest number 
of domain-level mutation events. This simple model can be elabo- 
rated by weighting loss and gain differently or by requiring that a 
domain or an architecture can only be gained at most once in a tree 
(Dollo parsimony) [79]. 

An early study of Snel et al. [80] considered 252 gene trees 
across 17 fully sequenced species and used parsimony to minimize 
the number of gene fission and fusion events occurring along the 
species tree. Their main conclusion, that gene fusions are more 
common than gene fissions, was subsequently supported by a larger 
study by Kummerfeld and Teichmann [81], where fusions were 
found to be about four times as common as fissions in a most 
parsimonious reconstruction. Fong et al. [82] followed a similar 
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procedure on yet more data and concluded that fusion was 5.6 
times as likely as fission. 

Buljan and Bateman [15] performed a similar maximum parsi- 
mony reconstruction of ancestral domain architectures. They too 
observed that domain architecture changes primarily take place at 
the protein termini, and the authors suggested that this might 
largely occur because terminal changes to the architecture are less 
likely to disturb overall protein structure. Moreover, they con- 
cluded from reconciliation of gene and species trees that domain 
architecture changes were more common following gene duplica- 
tions than following speciation but that these cases did not differ 
with respect to the relative likelihood of domain losses or gains. 

Recently, Buljan et al. [25] presented a new ancestral domain 
architecture reconstruction study which assumed that gain of a 
domain should take place only once in each gene tree, i.e., Dollo 
parsimony [79]. Their results also support gene fusion as a major 
mechanism for domain architecture change. The fusion is generally 
preceded by a duplication of either of the fused genes. Intronic 
recombination and insertion of exons are observed but relatively 
rarely. They also found support for de novo creation of disordered 
segments by exonization of previously noncoding regions. More 
recently still a method for domain architecture history reconstruc- 
tion using a network construct called a plexus was described 
[83]. Yang and Bourne [84] further described another 
parsimony-based reconstruction approach, as did Wu et al. [85], 
reporting that histories of signaling and development proteins are 
enriched for gene fusion/fission events. Stolzer et al. [86] present 
another method for domain architecture history inference, made 
available through the Notung software. 


9 Polyphyletic Domain Architecture Evolution 


There appears to be a “grammar” for how protein domains are 
allowed to be combined. If nature continuously explores all possi- 
ble domain combinations, one would expect that the allowed com- 
binations would be created multiple times throughout evolution. 
Such independent creation of the same domain architecture can be 
called convergent or polyphyletic evolution, whereas a single origi- 
nal creation event for all extant examples on an architecture would 
be called divergent or monophyletic evolution. This is relevant for 
several reasons, not least because it determines whether or not we 
can expect two proteins with identical domain architectures to have 
the same history along their entire length. 

A graph theoretical approach to answer this question was taken 
by Przytycka et al. [53], who analyzed the set of all proteins con- 
taining a given superfamily domain. The domain architectures of 
these proteins define a domain co-occurrence network, where 
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edges connect two domains both found in a protein, regardless of 
sequential arrangement. The proteins of such a set can also be 
placed in an evolutionary tree, and the evolution of all multi- 
domain architectures containing the reference domain can be 
expressed in terms of insertions and deletions of other domains 
along this tree to form the extant domain architectures. The ques- 
tion, then, is whether or not all leaf nodes sharing some domain 
arrangement (up to and including an entire architecture) stem from 
a single ancestral node possessing this combination of domains. For 
monophyly to be true for all architectures containing the reference 
domain, the same companion domain cannot have been inserted in 
more than one place along the tree describing the evolution of the 
reference domain. By application of graph theory and Dollo parsi- 
mony [79], they showed that monophyly is only possible if the 
domain co-occurrence network defined by all proteins containing 
the reference domain is chordal, i.e., it contains no cycles longer 
than three edges. 

Przytycka et al. [53] then evaluated this criterion for all super- 
family domains in a large-scale dataset. For domains where the 
co-occurrence network contained fewer than 20 nodes (domains), 
the chordal property and hence the possibility of complete mono- 
phyly of all domain combinations and domain architectures con- 
taining that domain held. By comparing actual domain 
co-occurrence networks with a preferential attachment null 
model, they showed that far more architectures are potentially 
monophyletic than would be expected under a pure preferential 
attachment process. This finding is analogous to the observation by 
Apic et al. [37] that most domain combinations are duplicated 
more frequently (or reshuffled less) than expected by chance. In 
other words, gene duplication is much more frequent than domain 
recombination [66]. However, for many domains that co-occurred 
with more than 20 other different domains, particularly for 
domains previously reported as promiscuous, the chordal property 
was violated, meaning that multiple independent insertions of the 
same domain, relative to the reference domain phylogeny, must be 
assumed. 

A more direct approach is to do complete ancestral domain 
architecture reconstruction of protein lineages and to search for 
concrete cases that agree with polyphyletic architecture evolution. 
There are two conceptually different methodologies for this type of 
analysis. Either one only considers architecture changes between 
nodes ofa species tree, or one considers any node in a reconstructed 
gene tree. The advantage of using a species tree is that one avoids 
the inherent uncertainty of gene trees, but on the other hand, only 
events that take place between examined species can be observed. 

Gough [59] applied the former species-tree-based methodol- 
ogy to SUPERFAMILY domain architectures and concluded that 
polyphyletic evolution is rare, occurring in 0.4—4% of architectures. 
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The value depends on methodological details, with the lower 
bound considered more reliable. 

The latter gene-tree-based methodology was applied by For- 
slund et al. [60] to the Pfam database. Ancestral domain architec- 
tures were reconstructed through maximum parsimony of single- 
domain phylogenies which were overlaid for multi-domain pro- 
teins. This strategy yielded a higher figure, ranging between 6% 
and 12% of architectures depending on dataset and whether or not 
incompletely annotated proteins were removed. The two different 
approaches thus give very different results. The detection of poly- 
phyletic evolution is in both frameworks dependent on the data 
that is used—its quality, coverage, filtering procedures, etc. The 
studies used different datasets which makes it hard to compare. 
However, given that their domain annotations are more or less 
comparable, the major difference ought to be the ability of the 
gene-tree method to detect polyphyly at any point during evolu- 
tion, even within a single species. It should be noted that domain 
annotation is by no means complete—only a little less than half of 
all residues are assigned to a domain [5]—and this is clearly a 
limiting factor for detecting architecture polyphyly. The numbers 
may thus be adjusted considerably upwards when domain annota- 
tion reaches higher coverage. A later study by Zmasek and Godzik 
[87] reports much higher rates (25-75%) still of polyphyletic evo- 
lution of eukaryotic multi-domain architectures, arguing that pre- 
vious datasets were too small to have the power to reveal this. 

Future work will be required to provide more reliable estimates 
of how common polyphyletic evolution of domain architectures 
is. Any estimate will depend on the studied protein lineage, the 
versatility of the domains, and methodological factors. A compre- 
hensive and systematic study using more complex phylogenetic 
methods than the fairly ad hoc parsimony approach, as well as 
effective ways to avoid overestimating the frequency of polyphyletic 
evolution due to incorrect domain assignments or hidden homol- 
ogy between different domain families, may be the way to go. At 
this point all that can be said is that polyphyletic evolution of 
domain architectures definitely does happen, but relatively rarely, 
and that it is more frequent for complex architectures and versatile 
domains. A detailed case study was made recently of netrin domain- 
containing proteins, where polyphyletic evolution in metazoa 
seems well-supported [88]; these authors further suggest the 
term merology for such polyphyletic evolution. A series of papers 
by Nagy and Patthy et al. [73, 89, 90] further elaborates on 
challenges faced within this line of research; they report strong 
confounding influence of gene prediction errors. They further 
propose the term epaktology for gene similarity resulting from the 
independent acquisition of two proteins by the same additional 
domain. The authors suggest such cases inflate both estimates of 
terminal domain changes and estimates of gene fusion-driven 
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changes in domain architecture. Beyond such changes, whether 
correctly inferred or not, the authors describe internal domain 
shuffling as an important mechanism for how domain architecture 
evolution has occurred. 


As access to genomic data and to increasing amounts of compute 
power has grown during the last decade-and-a-half, so has our 
knowledge of the overall patterns of domain architecture evolution. 
Still, no study is better than its underlying assumptions, and differ- 
ences in the representation of data and hypotheses mean that results 
often cannot be directly compared. Overall, however, the current 
state of the field appears to support some broad conclusions. 

Domain and multi-domain family sizes, as well as numbers of 
co-occurring domains, all approximately follow power laws, which 
implies a scale-free hierarchy. This property is associated with many 
biological systems in a variety of ways. In this context, it appears to 
reflect how a relatively small number of highly versatile components 
have been reused again and again in novel combinations to create a 
large part of the domain and domain architecture repertoire of 
organisms. Gene duplication is the most important factor to gen- 
erate multi-domain architectures, and as it outweighs domain 
recombination, only a small fraction of all possible domain combi- 
nations is actually observed. This is probably further modulated by 
family-specific selective pressure, though more work is required to 
demonstrate to what extent. Most of the time, all proteins with the 
same architecture or domain combination stem from a single ances- 
tor where it first arose, but there remains a fraction of cases, 
particularly with domains that have very many combination part- 
ners, where this does not hold. 

Most changes to domain architectures occur following a gene 
duplication and involve the addition of a single domain to either 
protein terminus. The main exceptions to this occur in repeat 
regions. Exon shuffling played an important part in animals by 
introducing a great variety of novel multi-domain architectures, 
reusing ancient domains as well as domains introduced in the 
animal lineage. 

In this chapter, we have reexamined with the most up-to-date 
datasets many of the analyses done previously on less data and 
found that the earlier conclusions still hold true. Even though we 
are at the brink of amassing enormously much more genome and 
proteome data thanks to the new generation of sequencing tech- 
nology, there is no reason to believe that this will alter the funda- 
mental observations we can make today on domain architecture 
evolution. However, it will permit a more fine-grained analysis, and 
also there will be a greater chance to find rare events, such as 
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independent creation of domain architectures. Furthermore, care- 
ful application of more complex models of evolution with and 
without selection pressure may allow us to determine more closely 
to what extent the process of domain architecture evolution was 
shaped by selective constraints. 


Materials and Methods 


Updated statistics were generated from the data in Pfam 30.0. All 
UniProt proteins in the SwissPfam set for Pfam 30.0 were included. 
These span 1090 bacteria, 506 eukaryotes, and 94 archaea. All 
Pfam-A domains regardless of type were included. However, as 
stretches of repeat domains are highly variable, consecutive subse- 
quences of the same domain were collapsed into a single pseudo- 
domain, if it was classified as type Motif or Repeat, as in several 
previous works [50, 60, 66, 82]. 

Domains were ordered within each protein based on their 
sequence start position. In the few cases of domains being inserted 
within other domains, this was represented as the outer domain 
followed by the nested domain, resulting in a linear sequence of 
domain identifiers. As long regions without domain assignments 
are likely to represent the presence of as-yet uncharacterized 
domains, we excluded any protein with unassigned regions longer 
than 50 amino acids (more than 95% of Pfam-A domains are longer 
than this). This approach is similar to that taken in previous works 
[59, 60, 71]. Other studies [50, 72] have instead performed addi- 
tional, more sensitive domain assignment steps, such as clustering 
the unassigned regions to identify unknown domains within them. 

Pfam domains are sometimes organized in clans, where clan- 
mates are considered homologous. A transition from a domain to 
another of the same clan is thus less likely to be a result of domain 
swapping of any kind and more likely to be a result of sequence 
divergence from the same ancestor. Because of this, we replaced all 
Pfam domains that are clan members with the corresponding clan. 

The statistics and plots were generated using a set of Perl and R 
scripts, which are available upon request. Power law regressions 
were done using the R nls function. For reasons of scale, the 
regression for a power law relation such as 


N= 
was performed on the equivalent relationship 
log(X) = (1/a)(log(c) — log(N)) 


for the parameters a and c, with the exception of the data for Fig. 6, 
where instead the relationship 


log( N) = log(c) — alog(X) 


Evolution of Protein Domain Architectures 


497 


was used. Moreover, because species or organism group datasets 
were of very different size, raw counts of domains were converted 
to frequencies before the regression was performed. 


12 Online Domain Database Resources 


Table 2 


A selection of protein domain databases 


For further studies or research into this field, the first and most 
important stop will be the domain databases. Table 2 presents a 
selection of domain databases in current use. 


Database URL Notes Reference 
ADDA http://ekhidna.biocenter. Automatic clustering of protein domain [11] 
helsinki.fi/sqgraph/ sequences 
pairsdb 
CATH http://www.cathdb.info Based solely on experimentally determined 3D [2] 
structures 
CDD http://www.ncbi.nlm.nih. Meta-database joining together domain Kai 
gov/Structure/cdd/ assignments from many different sources, as 
cdd.shtml well as some unique domains 
Gene3D http://gene3d.biochem. ` Bioinformatic assignment of sequences to CATH [4] 
ucl.ac.uk domains using hidden Markov models 
InterPro http: //www.ebi.ac.uk/ Meta-database joining together domain [6] 
interpro assignments from many different sources 
Pfam http://pfam.sanger.ac.uk Domain families are defined from manually [5] 
curated multiple alignments and represented 
using hidden Markov models 
ProDom http://prodom.prabi.fr Automatically derived domain families from [9] 
proteins in UniProt 
SCOP http://scop.mre-lmb.cam. Based solely on experimentally determined 3D [1] 
ac.uk structures 
SMART http: //smart.embl- Domain families are defined from manually [8] 
heidelberg.de curated multiple alignments and represented 
using hidden Markov models 
SUPE http://supfam.cs.bris.ac.  Bioinformatic assignment of sequences to SCOP [3] 
RFAMIL uk domains using hidden Markov models trained 
Y on the sequences of domains in SCOP 
Genome3D http://genome3d.eu/ Meta-database joining together domain [12] 


assignments from many different sources, 
operating on the architecture level for a set of 
selected genomes 
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13 Domain Architecture Analysis Software 


Table 3 


Several software tools have been described and made available that 
allow for analysis and visualization of domain architectures and 
their evolution. A selection of such tools is shown in Table 3. 

A few of these tools allow domain architecture evolution analy- 
sis by visualizing each protein’s domain architecture along a protein 
sequence tree. An example is the web tool TreeDom [96] which, 
given a protein domain family and an anchor sequence, fetches the 
family from Pfam and builds a tree with the nearest neighbors of the 
anchor sequence. An example output from TreeDom is shown in 
Fig. 7, in which a nonredundant set of representative proteomes 
were queried. Here one can see that while the NUDIX domain of 
the anchor sequence tends to co-occur with two other domains 
(zf-NADH-PPase and NUDIX-like), it also has recombined with 
many other domains over the course of evolution. 

Other tools allow different types of analyses, for instance, 
searching for similar domain architectures or showing taxonomic 
distributions. Some of the protein domain databases listed in 
Table 2 include variants of such analyses, while external tools 
typically offer more specialized functionality. For example, the 
Pfam website allows searching for domain content, while the java 
tool PfamAlyzer allows searching Pfam for particular domain archi- 
tecture patterns specified with a given domain order and 
spacing [94]. 

The RAMPAGE/RADS tools [95 ] make use of domain assign- 
ments for rapid homology searching. DoMosaics [92] is a software 


A selection of online software applying protein domain architecture evolution analysis 


Tool URL Description Reference 
CDART https: //www.ncbi.nlm.nih. Searches for proteins with similar domain [91] 
gov/Structure/lexington/ architecture 
lexington.cgi 
DoMosaics http: //www.domosaics.net/ Visualizes domain evolution using trees [92] 
FACT http://fact.cibiv.univie.ac.at/ Searches for functionally equivalent [93] 
proteins by scoring domain 
architecture similarities 
PfamAlyzer http://pfam.xfam.org/search Searches Pfam for proteins with specific [94] 
domain architecture patterns 
RADS/ http://rads.uni-muenster.de/ Homology searching by aligning multiple [95] 
RAMPAGE domains instead of residues 
TreeDom http://treedom.sbc.su.se/ Graphical web tool for analyzing domain [96] 


architecture evolution using Pfam 
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Fig. 7 TreeDom output using as query the NUDIX domain (PF00293), the human NUDT12 (Q9BQG2) protein, 
30 closest sequences, and RP15 (representative proteomes at 15% co-membership). The domains are green, 
NUDIX; blue, NUDIX-like (PF09296); yellow, zf-NADH-PPase (PF09297); red, Ocnus (PF05005); cyan, Ank_2 
(PF12796); black, Ank_5 (PF13857); orange, Prefoldin (PF02996); and pink, Fibrinogen_C (PF00147) 


tool that can act as a wrapper for domain annotation tools, allowing 
detailed visualization and analysis of domain architectures, as does 
DomArch [97]. The DAAC algorithm [98] explicitly transfers 
functional annotation to query sequences based on domain archi- 
tectural similarity to annotated homologs, as does FACT [93]. In 
the same vein, similarity measures between architectures are avail- 
able using the WDAC [99] tool and in ADASS [100]. Domain 
architecture similarity is used for orthology detection in the 
porthoDom software [68]. The DOGMA tool makes use of 
domain content data to assess completeness of a proteome or 
transcriptome [101]. 
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Exercises/Questions 


— Which aspects of domain architecture evolution follow from 
properties of nature’s repertoire of mutational mechanisms, 
and which follow from selective constraints? 


— What trends have characterized the evolution of domain archi- 


tectures in animals? 


— Discuss approaches to handle limited sampling of species with 
completely sequenced genomes. How can one draw general 
conclusions or test the robustness of the results? Apply, e.g., to 
the observed frequency of domain architectures that have 
emerged multiple times independently in a given dataset. 


— Describe the principle of “preferential attachment” for evolving 
networks. In what protein domain-related contexts does this 
seem to model the evolutionary process, and what distribution 
of node degrees does it produce? 


— What protein properties correlate with domain versatility? Can 
the versatility of a domain be different in different species 
(groups) and change over evolutionary time? 


— What protein domain-related properties differ between prokar- 
yotes and eukaryotes? 
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Abstract 


Understanding the abundance, diversity, and distribution of TEs in genomes is crucial to understand 
genome structure, function, and evolution. Advances in whole-genome sequencing techniques, as well as 
in bioinformatics tools, have increased our ability to detect and analyze the transposable element content in 
genomes. In addition to reference genomes, we now have access to population datasets in which multiple 
individuals within a species are sequenced. In this chapter, we highlight the recent advances in the study of 
TE population dynamics focusing on fruit flies and humans, which represent two extremes in terms of TE 
abundance, diversity, and activity. We review the most recent methodological approaches applied to the 
study of TE dynamics as well as the new knowledge on host factors involved in the regulation of TE activity. 
In addition to transposition rates, we also focus on TE deletion rates and on the selective forces that affect 
the dynamics of TEs in genomes. 


Key words Long-read sequencing, Transposition rates, Self-regulation, Effective population size, 
Adaptation, Horizontal transfer 


1 Transposable Elements Are Abundant and Active Genome Denizens 


Transposable elements (TEs) are short DNA sequences, typically 
from a few hundred bp to ~10 kb long, which have the ability to 
move around in the genome by generating new copies of them- 
selves. In addition to active autonomous elements, genomes also 
contained nonautonomous elements that can be mobilized by the 
enzymatic machinery of active TEs from the same family. Addition- 
ally, genomes contain TEs that cannot be mobilized anymore due 
to accumulation of mutations in their sequences [1]. TEs are an 
ancient, extremely diverse, and exceptionally active component of 
genomes. TEs have been found in virtually all organisms studied so 
far including bacteria, archaea, fungi, protists, plants, and animals 
[2-5]. The main TE groups, class I and class I, are present in all 
kingdoms, revealing their persistence over evolutionary time 
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[2]. These two classes of TEs differ in their transposition inter- 
mediates: while class I TEs transpose through RNA intermediates, 
class II TEs transpose directly as DNA. TEs within each class are 
further classified into (1) different orders, based on their insertion 
mechanism, structure, and encoded proteins; (2) different super- 
families, based on their replication strategy and on presence and 
size of target site duplications; and (3) different families, based on 
sequence conservation [2, 3]. Piegu et al. [1] criticized the current 
classification system, which accounts for sequence homology, struc- 
tural features, and target site duplications, because it does not 
always take into account the evolutionary origins of the TEs 
[1-3]. As a consequence, phylogenetically unrelated classes or sub- 
classes of TEs are grouped [1]. Piegu et al. [1] also suggested that a 
more inclusive classification that includes prokaryotic and eukary- 
otic TE classes should be considered. Recently, Arkhipova [6] 
proposed a TE classification system based on the replicative, inte- 
grative, and structural components of TEs, which integrates differ- 
ent aspects of all the existing classification systems [6]. 

TEs constitute a substantial albeit variable (from ~1% to almost 
90%) proportion of genomes [7, 8] (Fig. 1). The identification 
methods, as well as the sequencing and assembly methods, have 
an important effect in the TE content estimation [4, 9-11]. In 
some cases, the TE-generated fraction of genomes is likely to be 
underestimated because methods for detecting TEs in genomic 
sequences are necessarily biased toward younger and more easily 
recognizable TEs. Indeed, new tools developed in recent years are 
able to identify TEs that remained hidden until now [4, 11]. As an 
example, when the human genome was first sequenced, ~40—45% 
of the genome was identifiable TEs, 5% was genes and other func- 
tional sequences (functional RNAs or regulatory regions), and the 
remaining ~50% of the genome had no identifiable origin [12]. de 
Koning et al. [13] using a highly sensitive new strategy named 
P-cloud found that at least 66-69% of the human genome is identi- 
fiable as repetitive sequences, most of them derived from TEs 
[13]. In Drosophila melanogaster, third-generation sequencing 
techniques (3GS) have allowed the detection of 37% more TE 
insertions in chromosome 2L compared to previously available 
short-read sequencing estimates (see below) [14]. In other Dro- 
sophila species such as D. buzzatii, the TE content has also been 
updated from 6% to 11%, thanks to the recent availability of whole- 
genome sequences [15]. 

As mentioned above, TEs are extremely active genomic deni- 
zens that are able to generate mutations of a great diversity of types 
[16-21]. TE-induced mutations range from subtle regulatory 
mutations to gross genomic rearrangements and often have pheno- 
typic effects of a complexity that is not achievable by point muta- 
tions (Fig. 2). Among others, TEs can affect the expression of 
nearby genes by adding new splice sites, adenylation signals, 
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Fig. 1 TE content in the genome of different organisms expressed as percentage of the genome: Homo sapiens 
(~45% [12], >66% [13]) Mus musculus [143], Saccharomyces cerevisiae [144], Arabidopsis thaliana [145], 
Pyrococcus furiosus [146], Clostridium difficile [147], Danio rerio [133], Kryptolebias marmoratus [148], 
Bombyx mori [149], Hypothenemus hampei [150], Drosophila melanogaster (11%, [68], ~20% [69]), Pseu- 
dozyma antarctica, and Laccaria bicolor [151]. Zea mays [152] and Fritillaria imperialis [8]. All estimates were 
obtained with homology-based methods except [13] that uses P-cloud and [69] that uses de novo approaches 


promoters, or transcription factor binding sites [22-24]. TEs can 
also be targets of epigenetic histone modifications that spread into 
adjacent genes affecting their expression [25, 26]. In addition to 
transcriptional changes, TEs have been shown to affect translation 
regulation when they are transcribed within a mRNA [27-29], to 
contribute to protein-coding regions both at the transcript and at 
the protein level [30-35], and TE-encoded proteins have been 
domesticated and are part of host genes [17, 36-40]. TE excision 
can lead to DNA deletions [41], and TE insertion can result in 
adding DNA through A and, less frequently, through 5’ transduc- 
tion [42, 43]. Finally, ectopic recombination between TEs causes 
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Fig. 2 Effects of TEs on the host genome. (a) TEs can affect the expression and/or structure of genes. Exons 
are represented as blue boxes and TEs as green boxes. (1) A TE inserted in the upstream region of a gene can 
add insulator sequences, transcription factor-binding site (TFBS), or can disrupt an existing promoter gene; 
(2) A TE inserted in an intron can truncate the mRNA or induce alternative splicing; (3) A TE inserted in the 
downstream region of a gene can add microRNA binding sites or alter the polyadenylation site; (4) A TE 
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deletions, duplications, and sequence rearrangements. Two recent 
studies in the human genome identified 516 chromosome rearran- 
gements potentially generated by LINE-LINE nonallelic homolo- 
gous recombination and 78 HERV-mediated rearrangements 
[44, 45]. Both studies used the annotations of LINEs and 
HERVs in the reference genome and look for evidence of rearran- 
gements induced by these TEs using clinical databases of copy 
number variants containing information from thousands of 
patients. In addition to being associated with diseases 
[24, 46-49], the number of TE-induced mutations associated 
with positive effects on fitness-related traits also continues to 
increase both in humans and in Drosophila [50-63]. 

Overall, recent advances in sequencing technologies and in TE 
detection methods showed that, as expected, the TE content is 
higher than previously estimated. These new data also provided 
further evidence for the impact of TEs in genome function and 
genome structure. Thus, it is still indisputable that a thorough 
understanding of TE population dynamics is essential for the 
understanding of the eukaryotic genome structure, function, and 
evolution. 


2 Drosophila and Humans: Two Extremes in TE Diversity and Population Dynamics 


Much of the detailed information on TE evolution still comes from 
two species with the best-studied genomes: fruit flies 
(D. melanogaster) and humans. Fortunately, these two genomes 
represent two extremes in terms of TE diversity and population 
dynamics and thus give a reasonably diverse picture of the TE 
evolution and dynamics. For the rest of this chapter, we focus 


ën 


Fig. 2 (continued) inserted in the exon of a gene can lead to exonization of the TE or to transcript truncation; 
(5) the whole domain of a TE protein could insert in the coding region of a gene generating a chimeric gene 
with host and TE domains [5, 21]. In addition to these changes that depend on where the TE is inserted and on 
the sequences that the TE is adding, TEs can also alter the posttranslational modifications of histones. (b) TEs 
could also induce translation repression by generating secondary structure in the 3’ UTR of genes that leads to 
changes in the localization of the mRNA. This secondary structure could bind to one of the protein components 
of paraspeckle (P54"”) and translocate to paraspeckle, a group of subnuclear bodies, avoiding moving out of 
the nucleus. However, the same secondary structure could bind to the dsRNA-binding protein Staufen 7 
(STAU7) and in this case translocate to cytoplasm. Once in the cytoplasm, the secondary structure could bind 
to STAU7 again allowing translation, but under some situations mRNA could bind to the ds-RNA- dependent 
protein kinase (PKR) repressing translation [23]. (c) Ectopic recombination between TE copies (green boxes 
with yellow arrows) in the same orientation can lead to deletions when recombination takes place between 
copies located on the same chromatid (1) or deletions and duplications when recombination takes place 
between copies in different chromosomes (2) (recombination between two nonhomologous chromosomes 
should lead to a translocation). Ectopic recombination between TE copies in opposite orientation leads to 
inversion of the DNA between the two TEs (3) 


510 


Lain Guio and Josefa Gonzalez 


primarily on these two genomes and will highlight the similarities 
and differences observed between them. 

As mentioned above, the human reference genome has millions 
of TE copies, with 66-69% of the genome mostly derived from TE 
sequences [13]. Two human retrotransposable element (class I) 
families, LINE] (L1, long interspersed nuclear element 1) and 
Alu, account for 60% of all interspersed repeat sequences. The 
vast majority of the TEs in the human genome are fixed, and 
most families are inactive. However, some elements of the main 
families of human endogenous retrovirus (HERV-K) and LINE) 
elements show autonomous transposition. Meanwhile, elements of 
Alu and the hybrid SVA elements formed by SINEs (short inter- 
spersed nuclear elements), VNTRs (variable number tandem 
repeat), and Alus show nonautonomous activity [64—66]. 

In contrast, the fruit fly D. melanogaster reference genome 
contains only thousands of individual TE copies (5416 TE copies 
in FlyBase R6.04) accounting for only ~5.5% of the euchromatin 
[67]. Ifthe missing percentage of TEs detected in chromosome 2L 
is similar in other chromosomes, the euchromatin TE content 
might be higher (~ 8.7%) [14]. If heterochromatin is also included, 
TEs account for 11-20% of the D. melanogaster genome 
[68, 69]. D. melanogaster TEs belong to approximately 100 diverse 
families of both class I and class II elements [69, 70]. Each family 
consists of 1-304 copies with no dominant family corresponding to 
the majority of TEs. The only exception is INE-1 family that 
contains ~2000 copies and has been inactive for the past ~3—- 
4.6 million years [71-73]. The majority of TE families are consid- 
ered to be active in Drosophila: individual TE copies are generally 
polymorphic in the population and show a high sequence similarity 
[69, 70, 74, 75]. Indeed, there is experimental evidence showing 
that Gypsy and ZAM elements are active [76, 77]. Besides, there is 
indirect evidence for the activity of 24 D. melanogaster superfami- 
lies based on a whole-genome sequencing experiment of mutation 
accumulation lines [75] (Table 1). 

Why do these two genomes differ so profoundly in content, 
diversity, and activity of TEs? The answer must lie in different 
aspects of TE population dynamics within genomes and forces 
that lead to varying rates of TE family birth and extinction. In the 
rest of this review, we focus on the state of knowledge of different 
aspects of TE population dynamics and discuss aspects of TE family 
evolution. Specifically, we focus on rates of TE transposition, fixa- 
tion, or loss in human and D. melanogaster populations due to 
stochastic forces and natural selection for or against TE insertions 
and forces that affect coexistence of multiple TE families and the 
standing diversity of TE types (Fig. 3). 


New Insights on the Evolution of Genome Content: Population Dynamics of... 


Table 1 


Summary of recent TE population dynamic studies 
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Objectives 


Findings 


Relevance for TE dynamics References 


Overview of new 
discoveries about TEs in 
75 basidiomycete fungi 
genomes 


Characterization of TE 
content in the only 
selfing hermaphroditic 
vertebrate: the mangrove 
killifish Kryptolebias 
marmoratus 


Testing whether genome 
size equilibrium observed 
in 10 mammals and 
24 birds species is due to 
covariation between 
DNA gain by 
transposition and DNA 
loss by deletion 


Understanding the 
differences in abundance 
and diversity of L1 
elements across 
vertebrates 


TE content varies among 
species displaying different 
lifestyles from 0.1% to 
45.2%. The correlation 
between TE content and 
genome size is not strong. 
TEs seem essential for 
chromosomal architecture. 
A large battery of 
mechanisms to avoid 
transposition is present 


TE content is 27%. There isa Against expectations, the 


great diversity of families 
with a pronounce 
abundance of Helitrons 
compared to its closest 
phylogenetic relatives. TE 
sequence divergence is also 
higher in K. marmoratus 
compared to close species 


DNA gain varies by more than Genome size equilibrium is 


sixfold across mammals and 
30-fold across birds. DNA 
loss varies by twofold in 
mammals and threefold in 
birds. Neither DNA gain 
nor loss can solely explain 
variation in genome size. 
DNA loss exceeded gain in 
all but two lineages. Midsize 
deletions (31 bp to 10 kb) 
play a larger role than 
microdeletions (1-30 bp) in 
DNA loss 


Vertebrate Lls differ in the 
length of the 5’ UTR, 3’ 
UTR, and intergenic 
regions. They also differ in 
base composition with 
mammals and lizards 
showing a stronger A bias 
on the positive strand than 
frog and fish 


Mammals show very little 5’ 


The result of most TE activity [151] 


is likely neutral as they 
often insert in intergenic 
regions. However, TEs 
play an important role in 
the evolution of plant 
pathogens and probably in 
symbiotic species 


[148] 
number and composition 
of TEs in these selfing 
organisms is comparable to 
that of many other fish with 
outcrossing mating 
systems. The high Helitron 
content is one of the factors 
that could explain the high 
genetic diversity observed 
in this selfing killifish 


[134] 
maintained through DNA 
loss counteracting DNA 
gains through TE 
expansions. DNA loss has 
probably been driven by 
large deletions (>10 kb). 
Genome expansion via 
transposition could 
promote genome 
contraction through 
TE-mediated deletions 


[153] 
UTR homology due to the 
frequent acquisition of 
novel nonhomologous 5’ 
UTR during evolution. 
This seems not to occur in 
other groups of vertebrates 
since the relative 
conservation of the 5’ UTR 
and ORF] suggests that 
the host do not repress 
transposition in a 
sequence-specific way 


(continued) 
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Objectives 


Findings 


Relevance for TE dynamics References 


Understanding the role of 24 TE superfamilies are active Insertion rate is higher than 


TEs in D. melanogaster 
genome evolution, by 
estimating their insertion 
and deletion rates 


Characterization and 
description of TEs in the 
coffee berry borer 
Hypothenemus hampei 
genome 


To develop a 
comprehensive 
assessment of 
transposition activity at 
the A. thaliana species 
level 


in mutation accumulation 
lines. TE activity is 
background dependent. 
There is an association 
between activity of some 
TE families and chromatin 
state, as well as a week 
correlation between 
insertion activity and GC 
content, and a negative 
correlation between 
deletion activity and exon 
content 


8.3% of the genome are TEs 


(880 TE sequences): 
49.24% of the TEs are 
MITEs. Several new 
families described: Hypo 
belonging to Gypsy 
superfamily, Hamp a new 
non-LTR family and rosa a 
new DNA TE family 


The analysis includes 


211 samples collected all 
over the world. 165 of the 
326 families annotated in 
A. thaliana showed recent 
transposition activity at the 
species level. TE 
composition and activity 
are strongly affected both 
by environmental and 
genetic factors 


deletion rate which helps 
explain the relative stability 
of TE numbers and 
genome size in Drosophila 
in the face of previously 
reported deletion bias. 
Heterochromatin may play 
a bigger role than 
recombination in shaping 
TE accumulation 


Low TE content, compared 


with other insects, could be 
related to the reproductive 
characteristics and the 
population size of this 
species. Males have a 
chromosome set not 
transmitted to the next 
generation like asexual 
populations. The 
colonization of America 
probably produced a 
founder effect 


TEs have pervasive effects on 


the expression and 
methylation status of 
nearby genes which are 
likely deleterious and could 
help explain why bursts of 
transposition were not 
detected. Its self-fertilizing 
mating system should also 
lead to accelerated 
elimination of deleterious 
TE insertions. TEs are also 
involved in the generation 
of large-effect alleles at 
adaptive trait loci 


[75] 


[150] 


[154] 


(continued) 
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Characterization of TE 
presence/absence in 
216 A. thaliana 
accessions with respect to 
the reference genome 


To understand the role of 
TE in genome evolution 
of the sweet potato 
Ipomoea batatas 


TE deletions were biased 
toward pericentromeric 
regions, while TE 
insertions had a more 
uniform distribution over 
chromosomes. TE variants 
associated with changes in 
nearby gene expression and 


TEs are a significant source of [155] 


genetic variation. Most 
TEs present at low 
frequencies. TEs likely play 
a role in facilitating 
epigenomic and 
transcriptional differences 
between A. thaliana 


local and distal methylation accessions 


patterns 


1405 TEs described based on TE activity is tissue- and 

transcriptomic data. background-specific. 

417 TEs are expressed in Although several TEs are 

one or more tissues and expressed in all the tissues 

107 in the seven tissues and strains analyzed, some 

analyzed of them are active only in 
one specific strain and/or 
tissue. Authors suggest 
that TEs may play a role in 
environmental adaptation 


[156] 


3 Methodology Used 


to Study TE Population Dynamics 


TE dynamics continues to be studied using three main approaches: 
mathematical modeling, computer simulations, and the analysis of 
empirical data. Often a combination of these approaches is used to 
better understand TE abundance, diversity, and distribution 
(Table 2). Le Rouzic et al. [78] applied the statistical framework 
originally developed to infer speciation and extinction dynamics in 
species phylogenies to reconstruct the evolutionary history of TEs 
[78]. The model allows to estimate and to interpret the pattern of 
transposition activity that results in different TE copy number 
distributions [78]. The authors also performed computer simula- 
tions to provide reference dynamics that aid in the interpretation of 
the results obtained (Table 2). 

Traditionally, mathematical models considered the relationship 
between the host and a homogenous group of active TEs. How- 
ever, the TE content of any genome is a mixed of autonomous and 
nonautonomous insertions. Xue and Goldenfeld [79] proposed a 
mathematical model that considers the relationship between non- 
autonomous and autonomous TEs as a predator-prey dynamic. 
Unlike previous models that also use the analogy to ecological 
models, Xue and Goldenfeld model takes into account the 
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Fig. 3 Factors that influence the population and evolutionary dynamics of TEs. Our understanding of TE 
population and evolutionary dynamics is still incomplete. The different factors that affect TE population and 
evolutionary dynamics are interrelated, new factors have been identified in recent years, and future research 
is still likely to reveal existence of additional factors 


molecular level interactions between transposable elements and the 
small copy number of the active transposons. The model predicts 
oscillations in the number of TEs in a time scale much longer than 
the cell replication time, suggesting that the genome stores the 
predator-prey state during successive generations [79]. 

TE dynamics have also been analyzed in variable environments 
[80, 81] (Table 2). Gogolesky et al. [81] proposed a stochastic 
computational model to analyze the dynamics of active TEs in 
genomes of sexual diploid organisms under environmental stress. 
They based their model in the Fisher geometrical model of fitness 
landscapes. Overall, the authors conclude that the presence of 
inactive copies of TEs is necessary for the transposition-selection 
equilibrium of autonomous copies and that the mutator capacity of 
TEs might be important when host populations face rapid environ- 
mental changes [81 ]. 

Other recently developed methods analyzed the influence of 
the mating system in TE dynamics, different modes of selection, or 
applied branching models for studying the propagation of particu- 
lar TE classes [82-84] (Table 2). 

In addition to mathematical modeling and simulations, multi- 
ple computational tools have been developed to analyze TEs in 
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Table 2 
Summary of recent mathematical models and computer simulations applied to the study of TE 
dynamics 

TEs modelled 


Model description Conclusions References 


The model quantifies the Fot subfamilies from The four subfamilies analyzed are [78] 


transposition activity over Fusarium 
time based on the oxysporum 
distribution of transposition 

events in the phylogenetic 

tree and the tree topology 


Considering the genome as an LJ and Alus from 
ecosystem, the model Homo sapiens 
analyzes the interaction 

between nonautonomous 

and autonomous TEs as a 

predator-prey relationship in 

individual cells 


The model, based in the Fisher Autonomous and 


geometric model, analyzes nonautonomous 
TE dynamics under changing TEs in asexual 
environments in clonal population 


organisms 


still active with two of them 
showing clear changes in their 
transposition dynamics. The 
results obtained showed that 
regulation of transposition by 
the number of copies is not 
strong enough to maintain 
stable transposition-deletion 
equilibrium 


The model predicts oscillations [79] 
in the number of TEs in a time 
scale much longer than the cell 
replication time. Thus, the 
genome stores the predator- 
prey state during successive 
generations 


The model predicts that when [80] 
nonautonomous TE copies are 
present, the transposition 
activity is lost and thus the 
stability of the host-TE system 


is compromised. Changes in 
the environment may induce 
bursts of transposition activity 
associated with faster 
adaptation. However, it is 
unlikely that the transposition 
activity is maintained in the 
long term 


The model, based on the Fisher TEs in sexual diploid The model suggests that the [81] 
geometrical model, analyzes populations presence of inactive copies of 
TEs dynamics in sexual TEs is necessary for the 
diploid organisms under transposition-selection 
environmental changes equilibrium of active copies 
and that the mutagenic role of 
TEs is crucial when host 
populations face rapid 
environmental changes 


The model, based in the selfish Mosl and peach, 
DNA theory, analyzes the mariner family from 
invasion dynamics of active Drosophila 
TEs during the first stages of melanogaster 
an experimental evolution 
experiment 


The model predicts lower 
invasion frequencies than the 
ones observed experimentally. 
A substantial rate of replicative 
transposition during the initial 
invasion of the element was 


[102] 


(continued) 
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Table 2 
(continued) 

Model description TEs modelled Conclusions References 
inferred from the discrepancy 
between observed and 
theoretical copy numbers 

The model analyzes the impact Active TEs in a diploid The model predicts that the [82] 

of intermediate selfing rates hermaphrodite efficiency of TEs as genomic 
on TE dynamics and the population parasites decreases with the 
influence of the mating selfing rate, although rare TE 
system on the evolutionary invasions can still occur even in 
properties of TEs populations with 90% selfers. 
The model predicts TE 
extinction if populations 
change from sexual to asexual 
reproduction, although 
empirical data does not 
strongly support this result 

The model studies the TEs in sexual diploid The model predicts that weak [83] 

evolutionary behavior of TE populations selection allows high copy 

copy number and the numbers of TEs most of them 

molecular evolution of their inactive copies, while strong 

DNA sequences selection reduces the number 
of TEs but increases the 
proportion of active copies. 
Regarding TE sequences, the 
model shows that the 
phylogeny of these sequences 
allows distinguishing active 
copies from non- and less 
active copies 

The model analyzes the roo, Gypsyand DM412, The simulation estimates several [84] 

propagation of LTR TEs by TEs of LTR family parameters affecting the 
taking into account the TE from Drosophila propagation of TEs and 
position in the chromosome, melanogaster identifies the initial copy from 
the degradation level of the which three LTR families have 
TEs, and the duplication rate spread on the euchromatin 
that varies with the part of the 3L chromosome 


degradation level 


sequenced genomes in the last 5 years. While some of these tools 
aimed at assessing the global abundance and diversity of TEs in the 
genome, such as dnaPipeTE, or to annotate TEs in assembled 
genomes, such as REPET, most of them are focused on discovering 
and/or genotyping individual copies of TEs in the genome using 
next-generation sequencing (NGS) data [11, 64, 85-90]. The 
diversity of methods available makes it difficult to choose the 
most appropriate one for the analyses of a given genome. To try 
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to overcome this limitation, Nelson et al. [91] developed an 
integrated pipeline named McClintock that incorporates six com- 
plementary TE detection methods. McClintock generates standar- 
dized output for the different TE detection methods, thus 
facilitating the comparison of the results obtained with the different 
pipelines, as well as facilitating their installation and use [91]. This 
and other studies that compared the performance of several tools 
arrived to the same conclusion: several computational tools should 
be combined to increase the accuracy of TE analysis [64, 86, 91]. 

The availability of third-generation sequencing techniques 
(3GS) should help improve the detection and genotyping of TE 
insertions. Although 3GS was developed before 2010 [92], it has 
only been in the last few years when this technique has started to be 
used [14, 93]. Chakraborty et al. [14] reported the assembly of a 
D. melanogaster genome from a Zimbabwe strain using long-read 
single molecule real-time sequencing with 147X coverage. Among 
several novel structural variants described, they identified 37% 
additional TE insertions in the 2L chromosome compared with a 
previous study that used 70X coverage of short reads [14, 94]. 3GS 
technologies have also been applied to the sequencing of human 
genomes, although a detailed analysis of TE content based on long- 
read data has not been performed yet [95-97]. 

Recently, Disdero and Filée [98] introduced the first tool that 
uses long-read sequences to identify TE insertions in the 
D. melanogaster genome: LORTE [98]. The authors argue that 
available software based on short reads fail to correctly identify 
TEs that are present in highly repetitive regions of the genome, 
while long-read technologies should allow us to identify all TEs in a 
given genome. LoRTE, developed in Python, verifies presence 
and/or absence of previously annotated TEs and can also detect 
new insertions not previously annotated in the reference genome. 
LoRTE is able to work with low-coverage sequences (<10X) 
providing an efficient accurate TE annotation in a cost-effective 
manner [98]. 


4 Rates of Transposition 


4.1 Empirical 
Estimates of the Rates 
of Transposition in 
Drosophila and 
Humans 


Transposition rates in D. melanogaster have been traditionally esti- 
mated empirically by in situ hybridization and by using PCR 
approaches. The activation of TEs following intra- and interspecific 
hybridization has been studied in different Drosophila species 
[99-101]. For example, Vela et al. [100] estimated transpositions 
rates in D. buzzati-D. koepferae interspecific hybrid flies by in situ 
hybridization [100]. They found that hybrids showed at least one 
order of magnitude higher transposition rates than parental lines 
for at least three TE families [100]. Robillard et al. [102] estimated 
transposition rates by qPCR in an experimental evolution study in 
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4.2 Transposition 
Control Mechanisms 


4.2.1 


TE Self-Regulation 


which a TE insertion was introduced in a strain lacking insertions 
from that particular family [102]. In the first generations after the 
introduction of the TE insertion, the transposition rate was 
0.33-0.45 per copy per generation, while in the following genera- 
tions, transposition rates were reduced at least one order of magni- 
tude per copy per generation. These values represent the first steps 
in the invasion of a TE in a genome that is faster than the rate of 
transposition when measured in natural populations [102]. 

In the first edition of this chapter [103], we anticipated that 
NGS would allow studying transposition rates in a deeper and more 
accurate way. Indeed, recent studies have taken advantage of NGS 
data to estimate transposition rates in D. melanogaster. Rahman 
et al. [89] estimated using NGS data the transposition rate in the 
reference strain by comparing two available genomes that were 
sequenced with ~15 years difference. The average transposition 
rate for TEs belonging to different families was 7 x 10 °, which 
is on the same order of magnitude as the previously reported rates 
(~10-*-10~°). Furthermore, they confirmed the prediction of 
increased transposition rate in inbred lines: they estimated a higher 
average number of TE insertions in lab strains inbred for more 
generations compared with strains inbred for a smaller number of 
generations [89]. Adrion et al. [75] estimated spontaneous inser- 
tion and deletion rates in D. melanogaster mutation accumulation 
lines [75]. The authors identified 24 active superfamilies and esti- 
mated genome-wide insertion rates to be higher than deletion 
rates: 2.11 x 107° vs. 1.37 x 107?° per site per generation, 
respectively. Superfamily-specific rates of insertion varied from 
0 to 5.13 x 107° insertions per copy per generation and were 
within the range of previously estimated rates [75] (Table 1). 

In humans, previous studies estimated the transposition rate as 
in l in 95 to 1 in 250 births for L1, 1 in 20 births for Alu insertions, 
and l in 916 births for SVA retrotransposons [104-107]. Although 
there are several recent studies that estimate transposition rate in 
humans using NGS data, they all focused on somatic transposition 
in the brain or in tumor samples [47, 48, 90]. 


Understanding the mechanisms controlling the transposition of 
TEs is central to our understanding of TE dynamics. Many differ- 
ent mechanisms of TE regulation have been described [43, 108, 
109]. In this section, we will highlight recent advances in both TE 
self-regulation and regulation by host factors. 


Self-regulation of transposition was first described in prokaryotes 
and soon after in TEs involved in hybrid dysgenesis in Drosophila 
[110]. Recent studies have cast some doubt on one of the self- 
regulation mechanisms described: transposase overproduction 
inhibition. The transposase overproduction inhibition mechanism 
regulates the transposition of IS8630-Tcl-mariner piggyBac and 
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4.2.2 Regulation by Host 
Factors 


hobo-AC-Tam (hAT) superfamilies [111, 112]. However, several 
studies reported contradictory results suggesting that transposase 
inhibition by overproduction does not always happen [113]. Bire 
et al. [113] suggested that some works failed to detect transposase 
inhibition because cellular cofactors are necessary to execute this 
regulation system, and as such it can only be detected in in vivo 
experiments [113]. However, Woodard et al. [114] showed that 
aggregation of transposase proteins produces filamentous struc- 
tures (rodlets) in the nucleus in a host independent manner 
[114]. The authors further showed that a decline in transposition 
occurs after transposase concentrations are high enough for fila- 
mentous structures to be visible [114]. Thus, it is still not clear why 
some in vitro experiments failed to detect transposase overproduc- 
tion inhibition [114]. 


Small RNAs, such as small-interfering RNAs (siRNAs) and piwi- 
interacting RNAs (piRNAs), are well-known to play an essential 
role in silencing TEs and preventing transposition. Several recent 
reviews highlight the monumental progress in this field 
[115-119]. In addition to posttranscriptional regulation of TEs, 
small RNAs are involved in transcriptional regulation as well. In 
mouse, piRNAs are required for de novo methylation and silencing 
of TEs [120]. In Drosophila, Piwi proteins repress transcription and 
correlate with an increase in repressive chromatin marks at loci 
targeted by piRNAs [121]. 

While the role of siRNAs and piRNAs has been established for 
several years, a role of micro RNAs (miRs) in suppressing the 
mobility of retrotransposons was only recently described 
[122]. The authors showed that mir-128 binds to L1 RNA and 
represses its integration in humans [122]. 

New studies have also provided evidence for the role in TE 
repression of proteins previously known for their roles in other 
cellular processes such as interferon-stimulated proteins, the 
tumor suppressor p53, and the longevity regulating protein 
SIRT6. Several interferon-stimulated genes, such as the Moloney 
leukemia virus 10 (MOV10), the zinc-finger antiviral protein 
(ZAP), and the 3' repair exonuclease 1 (TREX1), which are asso- 
ciated with virus response, have been recently involved in the 
inhibition of Ll activity [66, 123]. Recently, it has also been 
shown that the p53 transcription factor, which is involved in stress 
response networks and acts to restrict oncogenesis, also restricts 
retrotransposon activity in zebra fish, flies, and humans [124]. The 
authors showed that 53 interacts with components of the piwi- 
interacting RNA to suppress retrotransposition [124]. Finally, the 
longevity regulating protein SIRT6is also involved in retrotranspo- 
son repression by coordinating their packaging into transcription- 
ally repressive heterochromatin. SIRT6 binds to the 5’ UTR region 
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of retrotransposons and mono-ADP ribosylates the Kriippel-asso- 
ciated protein 1 (KAPI) facilitating the interaction of API with 
the heterochromatin protein Ia (HPla) leading to chromatin 
compaction [125]. 


5 Rate of Fixation and Frequency Distribution 


5.1 Natural Selection 
Against TE Insertions 


5.2 TE-Induced 
Adaptations 


Natural selection and stochastic processes influence both the rate of 
fixation and the frequency distribution of TEs in populations. The 
efficiency of selection depends on the effective population size, 
which largely differs between Drosophila and humans: >10° and 
~10*, respectively [126, 127]. Thus, while in Drosophila the high 
efficiency of selection should led to the removal of slightly deleteri- 
ous TE insertions, in humans, these insertions may accumulate in 
the genome. Indeed most of the TE sequences in the human 
genome are remnants of ancient insertions [12]. 

A review by Barron et al. [128] explored the latest insights on 
the nature of selection acting against the deleterious effects of TEs 
in D. melanogaster populations [128]. More recently, Kofler et al. 
[129] analyzed intraspecific TE dynamics between D. melanogaster 
and D. simulans populations to shed light on the long-term evolu- 
tion of TEs [129]. They confirmed that most of the TEs are present 
at low frequencies in D. melanogaster and showed that the same 
pattern is present in D. simulans. Based on computer simulations 
showing that 50% of the TE families have temporally heteroge- 
neous transposition rates, and on the differences in TE composition 
between populations of the same species, the authors suggested 
that TE activity has recently increased in the two species. They 
proposed that the demographic history of both species, with a 
recent colonization of different environments, could be the cause 
of the high TE activity detected [129]. 

In humans, a recent study took advantage of the 1000 Genome 
Project data that reports 16,192 polymorphic TEs to perform the 
most complete TE dynamics analysis to date [130]. Most of the 
polymorphic TEs were found to be present at very low frequencies: 
>93% of TEs showed <5% allele frequency in 26 human popula- 
tions. These results confirm that overall polymorphic TE insertions 
are deleterious in humans as was previously suggested with smaller 
family-specific datasets [131]. 


Several recent reviews have compiled results that showcase the 
adaptive role of TEs [19, 24, 50, 59, 128]. We would like to 
highlight the recent discovery ofa TE in a fish-like marine chordate 
that encodes RAG-like proteins with endonuclease-transposase 
activity [39]. This discovery provides evidence that supports the 
TE origin hypothesis for the adaptive immune system in jawed 
vertebrates [39]. Two other recent publications provide 
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6 Rate of Loss 


experimental evidence for a role of TEs as providers of functional 
transcription factor binding sites (TFBS) involved in immune 
response and in cell pluripotency [50, 132]. A recent study linked 
ERV elements in humans with the interferon response pathway 
[50]. The authors showed that ERVs carrying enhancers have 
been co-opted to activate different genes involve in inflammatory 
response activated by interferon. This example shows how the 
exaptation of one family of TEs could shape a transcriptional net- 
work to activate different genes with one trigger system [50]. Sun- 
daram et al. [132] reported mouse-specific TEs that contain 
multiple transcription factor binding sites for pluripotency tran- 
scription factors. The majority of the TEs were experimentally 
shown to exhibit enhancer activity in mouse embryonic stem cells 
including an in silico reconstructed ancestral TE. This latter result 
suggests that ancestral TEs already had transcriptional regulatory 
sites [132]. 

In Drosophila, the adaptive role of several TEs has also been 
identified. Most of the TEs characterized so far are involved in 
stress response: viral infection and xenobiotics (Docl420, 
[60, 61]), oxidative stress (FBti0018880, [53]), xenobiotic stress 
(Accord, Toi, 63], and FBt0019627, [52]), cold stress 
(FBti0019985, [55]), and heavy metal stress (FBti0019170, [56]), 
while FBti0019386 insertion was associated with faster develop- 
mental time [54]. Some of these adaptive insertions have been 
shown to affect gene expression through different molecular 
mechanisms, such as affecting the polyadenylation site choice 
[52], and adding TFBS [53], while others have been associated 
with gene duplication [60, 62]. 


A recent study estimated genome-wide and superfamily-specific TE 
deletion rates in D. melanogaster inbred lines [75]. The authors 
found that most of the deletions involved retrotransposon elements 
suggesting that the deletions were due to ectopic recombination 
instead of excision. Deletion rates were smaller than insertion rates 
estimated in the same inbred lines [75 ]. 

In vertebrates, lineage-specific differences in TE deletion rates 
have been reported [133]. A possible explanation for this observa- 
tion is that the success of some families results in a competition for 
the genome resources leading to the elimination of other TE 
families [133]. 

In addition to TE deletion rates, DNA loss rates should also be 
considered. In the human linage, estimates of DNA loss are smaller 
than estimates of DNA gain, 650 Mb vs. 815 Mb [134], while in 
D. melanogaster, the rate of DNA loss is higher than the rate of 
DNA gain [135-137]. 
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7 Horizontal Transfer of TE Insertions 


8 Conclusion 


9 Questions 


In addition to parent to offspring transmission, TEs can also be 
horizontally transferred [138-141]. By combining simulation and 
analytical approaches, Groth and Blumenstiel [142] suggested that 
exposure rate to new TE families through horizontal transfer can be 
an important determinant of TE genomic content when the effects 
of drift in a population are weak [142]. Thus, larger populations are 
expected to carry a higher TE content if population exposure rate is 
proportional to population size [142]. So far, most of the evidence 
for TE horizontal transfer comes from closely related and geo- 
graphically close species [140]. There are several examples of hori- 
zontal transfer of TEs in Drosophila species, while so far horizontal 
transfer of TEs has not been described in humans [138 ]. 


Recent years have seen an increase in the number of reference 
genome sequences available as well as of population genome data- 
sets. The availability of all these genome sequences and the devel- 
opment of new bioinformatics tools have allowed us to update our 
previous estimates of genomic TE content that have increased both 
in humans and in D. melanogaster. These data has also allowed us to 
gather more evidence for the functional impact, both detrimental 
and beneficial, of TE insertions. Thus, it is still indisputable that 
understanding TE population dynamics is essential to understand 
genome structure, genome function, and genome evolution. 

New methods developed to analyze the dynamics of TEs in 
populations have shed light on the interplay between autonomous 
and nonautonomous TE copies, TE invasion dynamics, and how 
the mating system influences the dynamics of TEs in genomes. We 
have also considerably advanced our knowledge on the host factors 
that regulate TE activity as well as in the genome features that 
influence TE dynamics (Fig. 3). Finally, differences in effective 
population sizes that affect the efficiency of selection against new 
TE insertions and differences in the rates of TE loss between 
humans and D. melanogaster can still be considered two important 
factors that contribute to the different abundance, diversity, and 
activity of TEs in this two species [103]. 


How differences in the rate of DNA loss can affect the evolutionary 
dynamics of TEs? 


Why host regulation of transposition is relevant for TE dynamics? 
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Which is the most important factor explaining the differences in TE 


content, 
Drosophila: 


diversity, 


and 


activity between humans and 


Have the next-generation sequencing (NGS) technologies allowed 
us to identify all the TEs in a given genome? 


How does the interaction between active and inactive copies of TEs 


affect TE dynamics? 
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Abstract 


In this chapter, we give a short introduction to the genetics of complex diseases emphasizing evolutionary 
models for disease genes and the effect of different models on the genetic architecture, and we give a survey 
of the state-of-the-art of genome-wide association studies (GWASs). 


Key words Complex diseases, Association mapping, Genome-wide association studies, Common 
disease common variant 


1 Introduction 


A combination of genes and environment determines our pheno- 
type. The degree to which genotype or environment influences our 
phenotype—the balance of nature versus nurture—varies from trait 
to trait, with some traits independent of genotype and determined 
by the environment alone and others determined by the genotype 
alone and independent of the environment. 

A measure quantifying the importance of genotype compared 
to the environment is the so-called heritability. It is the fraction of 
the total phenotypic variation in the population explained by varia- 
tion in the genotype within the population [1]. A trait of interest, 
say a common disease, which exhibits a nontrivial heritability, tells 
us that genes are important for understanding this trait and that it is 
worthwhile to identify the specific genetic polymorphisms influen- 
cing the trait. The first step toward this is association mapping: 
searching for genetic polymorphisms that, statistically, associate 
with the trait. Polymorphisms associated with a given phenotype 
need not influence that phenotype directly, but it is among those 
associated genetic polymorphisms that we will find the causal ones. 
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Genetic variants are correlated, a phenomenon called linkage 
disequilibrium (LD), so by examining the trait association of a few 
variants, we learn about the association of many others. Examining 
the association between a phenotypic trait and a few hundred 
thousand to a million genetic variants suffices to capture how 
most of the common variation in the entire genome associates 
with the trait [2-4]. When we find a genetic variant associated 
with the trait, we have not necessarily located a variant that has 
any functional effect on the trait, but we have located a genomic 
region containing genetic variation that does. LD is predominantly 
a local phenomenon, so correlated genetic variants tend to be 
physically near each other on the genome. If we observe an associa- 
tion between the phenotype and a variant, and the variant is not 
causally affecting the trait but is merely in LD with a causal variant, 
the causal variant is likely nearby. Further examination of the region 
might reveal which variants affect the trait, and how, but that often 
involves functional characterization and is beyond association 
mapping. With association mapping, we merely seek to identify 
genetic variation that associates with a trait. 


2 The Allelic Architecture of Genetic Determinants for Disease 


Many complex diseases show a high heritability, typically ranging 
between 20% and 80%. Each genetic variant that increases the risk 
of disease contributes to the measured heritability of the disease and 
thus explains some fraction of the estimated total heritability of the 
trait. For most diseases investigated, many variants contribute, and 
the fraction of the heritability explained for each is therefore low. 
The number of contributing variants, their individual effects on the 
disease probability, their selection coefficient, and their dominance 
relations can be collectively termed the genetic architecture of a 
common disease. Insights into this architecture are slowly emerging 
and reveal differences between diseases [5 ]. 

Below we first consider two proposed genetic architectures 
based on theoretical arguments: the common disease common 
variant (CDCV) architecture and the common disease rare variant 
(CDRV) architecture. CDCV states that most of the heritability can 
be explained by a few high-frequency variants with moderate 
effects, while CDRV states that most of the heritability can be 
explained by moderate- or low-frequency variants with large effects. 
We present population genetic arguments for the two architectures 
and the consequences of the two architectures for association 
mapping. Later, in Subheading 5.1, we present empirical knowl- 
edge we have obtained about the genetic architectures of common 
diseases. 


2.1 Theoretical 
Models for the Allelic 
Architecture of 
Common Diseases 
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Understanding the distribution of the number and frequency of 
genetic variants in a population is the purview of population genet- 
ics. Using diffusion approximations we can derive the expected 
frequency distribution of independent mutations under mutation- 
drift-selection balance in a stable population (see, e.g., Wright [6]). 
Central parameters are the mutation rate, u, and the selection for or 
against an allele, measured by s, scaled with the effective population 
size, N. Mutations enter a population with a rate determined by 
Nu, and subsequently, their frequencies change in a stochastic 
manner. If a mutant allele is not subject to natural selection, for 
example, ifit does not lead to any change in function, it is selectively 
neutral. Its frequency then rises and falls with equal probability. If 
the allele is under selection, it has a higher likelihood of increasing 
in frequency than decreasing if it is under positive selection (s > 0) 
and conversely for negative selection (s < 0). 

At very high or very low frequencies, selection has an insignifi- 
cant effect on the change in frequency, and the system evolves 
essentially entirely stochastic (genetic drift). At moderate frequen- 
cies, however, the effect of selection is more pronounced, and given 
sufficiently strong selection (of an order Ns > 1), the direction of 
changes in the allele frequency is almost deterministically deter- 
mined by the direction of selection. An allele subject to a sufficiently 
strong selection that happens to reach moderate frequencies either 
halts its increase if selection works against it, and drifts back to a low 
frequency, or if selection favors it, it rapidly rises to high frequen- 
cies, where eventually the stochastic effects again dominate (see 
Fig. 1). 

The range of frequencies, where drift dominates, or selection 
dominates, is determined by the strength of selection (Ns) and the 
genotypic characteristics of selection, as, e.g., dominance relations 
between alleles. For strong selection or in large populations, the 
process is predominantly deterministic for most frequencies, while 
for weak selection or a small population, the process is highly 
stochastic for most frequencies. The time an allele can spend at 
moderate frequencies is also determined by Ns and selection 
characteristics. 

Pritchard and Cox [7, 8] used diffusion arguments to show that 
common diseases are expected to be caused by a large number of 
distinct mutations. This implies that genes commonly involved in 
susceptibility exert their effect through multiple independent 
mutations rather than a single mutation identical by descent in all 
carriers (see Fig. 2). Each mutation, if under weak purifying selec- 
tion, is unlikely to reach moderate frequencies, and since the popu- 
lation will only have few carriers of each disease allele, each can only 
explain little of the heritability. The accumulated frequency of 
several alleles, each kept to low frequency by selection, can, how- 
ever, reach moderate frequencies. So the heritability can be 
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Fig. 1 Mutation, drift, and selection. New mutations enter a population at stochastic intervals, determined by 
the mutation rate, u, and the effective population size, N. For low or high frequencies, where the range of such 
frequencies is determined by the selection factor, s, and the effective population size, the frequency of a 
mutant allele changes stochastically. At medium frequencies, on the other hand, the frequency of the allele 
changes up or down, depending on s, in a practically deterministic fashion. If a positively selected allele 
reaches moderate frequency, it will quickly be brought to high frequency, at a speed also determined by s and 
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Fig. 2 Accumulation of several rare frequencies. If selection works against a set of alleles, each will be kept at 
a low frequency. Their accumulated frequency, however, can be high in the population 


explained either by many recurrent mutations or many independent 
loci affecting the disease: the CDRV architecture. 

Implicitly, this model assumes a population in mutation- 
selection equilibrium, and this does not necessarily match human 
populations. Humans have recently expanded considerably in num- 
bers, and changes in our lifestyle, e.g., from hunter-gatherers to 
farmers might have changed the adaptive landscape driving selec- 
tion of our genes. 
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The frequency range where drift, rather than deterministic 
selection, dominates is larger with a smaller population than with 
a larger population. We can think of the drift process as a birth—- 
death process operating on individual copies of genes, which is 
highly stochastic. Only when we consider a large number of these 
processes do we get an almost deterministic process. At low allele 
frequencies, the process is stochastic because we only have a few 
copies of the allele to consider. At higher frequencies, we have many 
copies, so we get the deterministic behavior. The same number of 
copies, however, constitutes a higher frequency of a small popula- 
tion than ofa larger population. Consequently, selection is effective 
at much lower frequencies in a large population than it is in a small 
population; the absolute number of copies of a deleterious allele 
might be the same in a small and a large population, but they 
constitute a smaller fraction of the large population. In large popu- 
lations, we expect to see deleterious mutations to be found at small 
frequencies unless, as is the case for most human populations, the 
large population size is a consequence of recent dramatic growth 
[9]. This effect is illustrated as the “transient period” in Fig. 3, 
where common genetic variants may contribute much more to 
disease than under stable demographic conditions. Following 
expansion, alleles that would otherwise be held at low frequency 
by selection may be at moderate frequencies and thus contribute a 
larger part of the heritability: the CDCV architecture. 

Similarly, a recent change in the adaptive landscape of a popu- 
lation might cause an allele that was previously held at low fre- 
quency to be under positive selection and now rise in frequency 
[10]. In this transition period, an allele may be at a moderate 
frequency and therefore contributes significantly to the heritability 
of disease susceptibility (see Fig. 4). 

Depending on which architecture underlies a given disease, 
different strategies are needed to discover the genetic variants 


Transition period where allele frequency is higher 
than what would be expected from selection 


Fig. 3 A population out of equilibrium following an expansion. In a transition period following a population 
expansion, the allele frequency patterns are different from the patterns in a stable population 
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Fig. 4 A population out of equilibrium following changes in the selective landscape. If the selection of an allele 
changes direction, so the positively selected allele becomes negatively selected and vice versa, it will 
eventually move through moderate frequencies. Following a change in the selective landscape, it is thus 
possible to find alleles at moderate frequencies that would not otherwise be found 


2.2 The Allelic 
Frequency Spectrum in 
Humans 


involved. When genome-wide association mapping was proposed as 
a strategy for discovering disease variants, the proposal was based 
on the hypothesis that, at least for some common diseases, the 
CDCV architecture underlies them. GWAS relies on the CDCV 
hypothesis for two practical reasons. The first is that the LD pat- 
terns across the genome greatly restrict examination to only a small 
fraction of the total possible variation. It is feasible to probe the 
common variants of a genome from a small selection of representa- 
tive variants, but the association with rare variants is far less detect- 
able. Second, statistical analysis of the association between 
polymorphism and disease is rather straightforward for moderate- 
frequency alleles but has far less power to detect association with 
low-frequency alleles. 

While the GWAS approach is only practical as an approach for 
variant discovery for common alleles, it was necessary to hypothe- 
size that the CDCV architecture would be underlying diseases of 
interest. The actual genetic architecture behind common diseases 
was unknown, but there were no alternative methods aimed at 
CDRV, so GWAS was the only show in town. 


The vast majority of human nucleotide variation is very rare because 
of our history of population bottlenecks followed by rapid growth. 
For instance, in the 2500 individuals of the 1000 genomes study, 
64 million SNVs have frequency <0.5%, and 20 million SNVs have 
frequency >0.5% [11]. Nevertheless the majority of heterozygous 
variants observed within a single individual are not rare [11]. The 


3 The Basic GWAS 


3.1 Statistical Tests 
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rare variants are most often very recent and therefore specific to 
populations, and they are also more often deleterious because 
selection has not yet acted on them [12]. This is particularly clear 
for loss-of-function variants and other protein-coding variants. A 
study of 2636 Icelanders found that the fraction of variants with a 
minor allele frequency (MAF) below 0.1% was 62% for protein- 
truncating variants, 46% for missense variants, and 38% for synony- 
mous variants |13]. 

The strong recent population expansions have also allowed 
variants to increase in frequency by surfing on the population 
expansion wave front even if they would be selected against in a 
population with stable size. Thus, rare variants with large effects on 
disease may exist. The GWAS studies so far have been successful in 
identifying a large set of common variants associated with disease, 
so common variants contributing to disease do exist. It is likely that 
rare variants with large phenotypic effects also contribute to the 
heritability of many common diseases, but the extend is likely to be 
disease specific. 


The first GWASs were published around 2006 [14, 15] when 
Illumina and Affymetrix first introduced genotyping chips that 
made it possible to test hundreds of thousands of SNPs quickly 
and inexpensively. The GWASs’ approach to find susceptibility 
variants for diseases boils down to testing approximately 0.3—- 
2 million SNPs (depending on chip type) for differences in allele 
frequencies between cases and controls, adjusting for the high 
number of multiple tests. This approach is a wonderfully simple 
procedure that requires no complicated statistics or algorithms but 
only well-known statistical tests and a minimum of computing 
power. Despite the simplicity, some issues remain, such as faulty 
genotype data and confounding factors that can result in erroneous 
findings if not handled properly. The most important aspects of any 
GWAS are, therefore, thorough quality control of the data used and 
measures to avoid and reduce the effect of confounding factors. 


The primary analysis in an association study is usually testing each 
variant separately under the assumption of an additive or multipli- 
cative model. One way of doing that is by creating a 2 x 2 allelic 
contingency table as shown in Table 1 by summing the number of 
A and B alleles seen in all case individuals and all control individuals. 
Be aware that we are counting alleles and not individuals in this 
contingency table, so Neases Will be equal to two times the number 
of case individuals because each individual carries two copies of each 
variant unless we are looking at non-autosomal DNA. If there is no 
association between the variant and the disease in question, we 


540 


Søren Besenbacher et al. 


Table 1 
Contingency table for allele counts in case/control data 


Allele A Allele B 
Case Nease,A Nease,B Neases 
Control Jee Neontrol,B Jet 
Na Ng N 


Table 2 
Expected allele counts in case/control data 


Allele A Allele B 
Case (Neases ` Na )/ N (Neases ` Ng)/ N Neases 
Control (Neontrots ` Na)/N (Neontrots * Np)/N Ness 
Na Ng N 


would expect the fraction of cases that have a particular allele to 
match the fraction of controls that have that allele. In that case, the 
expected allele count (EN) would be as shown in Table 2. To test 
whether the difference between the observed allele counts 
(in Table 1) and the expected allele counts (in Table 2) is signifi- 
cant, a Pearson y” statistic can be calculated: 


2 2 
Xv = Z Phenotype ŽAllele (N Phenotype, Allele — ENphenotype, Allele) / EN Phenotype, Allele 


This statistic approximates a y* distribution with 1 degree of 
freedom, but if the expected allele counts are very low (<10), the 
approximation breaks down. This means that if the MAF is very low 
or if the total sample size, N, is small, an exact test, such as the 
Fisher’s exact test, should be applied. An alternative to the tests that 
use the 2 x 2 allelic contingency table and thereby assumes a 
multiplicative model is the Cochran—Armitage trend test that 
assumes an additive risk model [16]. This test is preferred by 
some since it does not require an assumption of Hardy-Weinberg 
equilibrium in cases and controls combined [17]. 

While a 1 degree of freedom test that assumes an additive or 
multiplicative model is usually the first analysis, some studies also 
perform a test that would be better at picking up associations 
following a dominant or recessive pattern, for instance, by 
performing a 2 degrees of freedom test of the null hypothesis of 
no association between rows and columns in the 2 x 3 contingency 
table that counts genotypes instead of alleles. 


3.2 Effect Estimates 


3.3 Quality Control 
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A commonly used way of measuring the effect size of an association 
is the allelic odds ratio (OR), which is the ratio of the odds of being 
a case given that you carry a copies of alleles A to the odds of being 
a case if you carry 7 — | copies of allele A. Assuming a multiplicative 
model, this can be calculated as: 


OR= (Nase! Ncontro, a) EEN 
= N case, A N control, B/N case, B N control, A 


Another measure of effect size that is perhaps more intuitive is 
the relative risk (RR), which is the disease risk in carriers divided by 
the disease risk in noncarriers. This measure, however, suffers from 
the weakness that it is harder to estimate. If our cases and controls 
were sampled from the population in an unbiased way, the allelic 
RR could be calculated as: 


RR = (Nrease,a/ Na) /(Ncase,p/ NB) 


but it is very rare to have an unbiased population sample in associa- 
tion studies because the studies are generally designed to deliber- 
ately oversample the cases to increase the power. This oversampling 
affects the RR as calculated by the formula above but not the OR 
which is one of the reasons why the OR is usually reported in 
association studies instead of the RR. 


Data quality problems can be either variant specific or individual 
specific, and inspection usually results in the removal of both prob- 
lematic individuals and problematic variants from the data set. 

Individual-specific problems can be caused by low DNA quality 
or contamination by foreign DNA. A sample of low DNA quality 
results in a high rate of missing data, where particular variants 
cannot be called, and there is a higher risk of miscalling variants. 
It is, therefore, recommended that individuals lacking calls in more 
than 2-3% of the variants are removed from the analysis. Excess 
heterozygosity is an indicator of sample contamination, and indivi- 
duals displaying that should also be disregarded. Sex checks and 
other kinds of phenotype tests might also be applied to remove 
individuals, where the genotype information does not match the 
phenotype information due to a sample mix-up [18]. 

For a given variant, the data from an individual can be suspi- 
cious in two ways: it can fail to be called by the genotype-calling 
program or it can be miscalled. Typically, a conservative cutoff value 
is used in the calling process securing that most problems show up 
as missing data rather than miscalls. Most problematic variants, 
therefore, reveal a high fraction of missing data, and variants miss- 
ing calls above a given threshold (typically, 1-5%) are removed. 
Miscalls typically occur when the homozygotes are hard to distin- 
guish from the heterozygotes, and some of the heterozygotes are 
being misclassified as homozygotes or vice versa. Both biases 
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3.4 Confounding 


Factors 


Sample Quantiles 


manifest as deviation from Hardy—Weinberg equilibrium, and SNPs 
that show large deviations from Hardy-Weinberg equilibrium 
within the controls should be removed [19]. 


Confounding in GWAS can arise if there are genotyping batch 
effects or if there is population or family structure in the sample. 
For example, if cases and controls in GWAS are predominantly 
collected from geographically distinct areas, association signals 
could arise due to genetic differences caused by geographic varia- 
tion, and most of such genetic signals are unlikely to be causal. Such 
confounding due to population structure typically occurs when 
samples have different genetic ancestry, e.g., if the sample contains 
individuals of both European and Asian ancestry. Population struc- 
ture confounding can also happen when the population structure is 
more subtle, especially for large sample sizes. Methods for inferring 
population substructure, such as principal components analysis, are 
useful for detecting outliers we can remove from the data 
[20]. However, this approach is not suitable when dealing with 
subtle structure, as a small bias can become significant in a large 
enough sample of individuals of similar genetic ancestry. 
Confounding in GWAS can be detected as inflation of the test 
statistics, beyond what is expected due to truly causal variants. A 
useful way of visualizing such inflation of test statistics is the 
so-called quantile—quantile (QQ) plot. In this plot, ranked values 
of the test statistic are plotted against their expected distribution 
under the null hypothesis. In the case of no true positives and no 
inflation of the test statistic due to population structure or cryptic 
relatedness, the points of the plot lie on the x = y line (see Fig. 5a). 
True positives show an increase in values above the line in the right 
tail of the distribution but do not affect the rest of the points since 
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Fig. 5 QQ plots from a y? distribution. (a) A QQ plot, where the observation follows the expected distribution. 
(b) A QQ plot, where the majority of observations follow the expected distribution, but where some have 
unexpectedly high values, i.e., are statistically significant. (c) A QQ plot, where the observations all seem to be 
higher than expected, which is an indication that the observations are not following the expected distribution 
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only a small fraction of the SNPs is expected to be true positives 
(Fig. 5b). Cryptic relatedness and population stratification lead to a 
deviation from the null distribution across the whole distribution 
and can, thus, be seen in the QQ plot as a line with a slope larger 
than 1 (Fig. 5c). 

Several approaches accounting for population structure in 
GWAS have been proposed. Devlin and Roeder [21, 22] proposed 
genomic control, i.e., to shrink the observed Al test statistic to make 
the median coincide with the expected value under the null model. 
However, studies by Yang et al. [23] and Bulik-Sullivan et al. [24] 
pointed out that the median and mean A7 statistic is expected to be 
inflated for polygenic traits, even when there is no population 
structure confounding. With that in mind, we recommend adjust- 
ing for the confounders in the statistical model instead of 
performing genomic control. One such approach is to include 
covariates that capture the relevant structure in the model. Price 
et al. [25] proposed including the largest principal components as 
covariates in the model to adjust for population structure. This 
approach has proved to be effective in most cases. However, if the 
sample includes related individuals or if it is very large, controlling 
for the top PCs may not be able to capture subtle structure. An 
alternative approach is to use mixed models [26, 27], where the 
expected genetic relatedness between the individuals is included in 
the model. Advances in computational efficiency of mixed models 
[28] now enable analysis of very large and complex data sets, such 
as the UK biobank data set [29]. 

Besides population structure, family structure or cryptic relat- 
edness can also confound the analyses. Here one can identify closely 
related individuals by calculating a genetic relatedness matrix and 
prune the data so that it does not contain any close relatives. Lastly, 
sequencing batch effects due to incomplete randomizations can 
lead to structure, unrelated to genetics, which confounds the anal- 
ysis. A study on polygenic prediction of longevity by Sebastiani 
et al. [30] serves as a warning. The researchers applied two different 
kinds of chips and failed to remove several SNPs that exhibited bad 
quality on only one of the chips [31]. If the fraction of the two 
different kinds of chips had been the same in both cases and con- 
trols that would probably not have resulted in false signals, unfor- 
tunately, the chip with the bad SNPs was used in twice as many cases 
as controls. When this genotyping batch effect was discovered, the 
authors had to retract their publication from Science. Type and 
frequency of errors that may happen during sample preparation and 
SNP calling are likely to vary through time and space, so case and 
control samples should be completely randomized as early as possi- 
ble in the procedure of genotypic typing. Failure to carefully plan 
this aspect of an investigation introduces errors in the data that are 
hard, if not impossible, to disclose, and they may reduce interesting 
findings to mere artifacts. 
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3.5 Meta-analysis 
of GWAS 


3.6 Replication 


The statistical power to detect association depends directly on the 
sample size used, all other things being equal. This fact has driven 
researchers to collaborate across institutions and countries in 
GWAS consortia, where they combine multiple cohorts in one 
large analysis. However, for logistic and legal reasons, it may not 
be possible to share individual-level genotypes, which are required 
for all of the GWAS approaches covered so far. Meta-analyses of 
GWASs performed in each cohort are a solution to this problem. 
These require coordination between the researchers, where they 
share GWAS summary statistics instead of individual-level geno- 
types. These summary statistics are then meta-analyzed using sta- 
tistical approaches that either assume a constant effect across 
cohorts or not. In recent years many large-scale GWAS meta- 
analyses have been published, and the resulting summary statistics 
of these are often made public, providing a treasure trove for 
understanding genetics of common diseases and traits [32]. 


The best way to make sure that a finding is real is to replicate it. If 
the same signal is found in an independent set of cases and controls, 
it means that the association is unlikely to be the result of a con- 
founding factor specific to the original data. Likewise, if the associ- 
ation persists after typing the markers using another genotyping 
method, it means that it is not a false positive due to some artifact of 
the genotyping method used. 

When trying to replicate a finding, the best strategy is to try to 
replicate it in a population of similar ancestry. A marker that corre- 
lates with a true causal variant in one population might not be 
correlated with the same variant in a population of different ethnic- 
ity, where the LD structure can be different. This is especially 
problematic when trying to replicate an association found in a 
non-African population in an African population [33]. A marker 
might easily have 20 completely correlated markers in a European 
population, but no good correlates in an African population. To 
replicate a finding in the European population of one of these 
variants, it does not suffice to test one of the variants in an African 
population; all 20 variants must be tested. This, however, also offers 
a way to fine map the signal and possibly find the causative 
variant [34]. 

Before spending time and effort to replicate an association 
signal in a foreign cohort, it is a good idea to search for the existing 
partial replication of the marker within the data. Usually, a marker is 
surrounded by several correlated markers on the genotyping chip, 
and if one marker shows a significant association, then the corre- 
lated markers should show an association too. Ifa marker is signifi- 
cantly associated with a disease, but no other marker in the region 
is, then it should be viewed as suspicious. 
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4 Imputation: Squeezing More Information Out of Your Data 


4.1 Selection of 
Reference Data Set 


The current generation of SNP chip types includes only 0.3—- 
2 million of the nine to ten million common SNPs in the human 
(De, SNPs with a MAF of more than 5%). Because of the correla- 
tion between SNPs in LD, however, the SNP chips can still claim to 
assay most of the common variants in the genome (in European 
populations at least). Although the Illumina HumanHap300 chip 
only directly tests about 3% of the ten million common SNPs, it still 
covers 77% of the SNPs in HapMap with a squared correlation 
coefficient (77) of at least 0.8 in a population of European ancestry 
[35]. The corresponding fraction in a population of African ances- 
try is only 33%, however. 

These numbers expose two limitations of the basic GWAS 
strategy. First, there is a substantial fraction of the common SNPs 
that are not well covered by the SNP chips even in European 
populations (23% in the case of the HumanHap300 chip). Second, 
we rely on tagging to test a large fraction of the common SNPs, and 
this diluted signal from correlated SNPs inevitably causes us to 
overlook true associations in many instances. An efficient way of 
alleviating these limitations is genotype imputation, where geno- 
types that are not directly assayed are predicted using information 
from a reference data set that contains data from a large number of 
variants. Such imputation improves the GWAS in multiple ways: It 
boosts the power to detect associations, gives a more precise loca- 
tion of an association, and makes it possible to do meta-analyses 
between studies that used different SNP chips [36]. 


The two important choices when performing imputation are the 
reference data set to use and the software to use. Usually, a publicly 
available reference data set, such as the 1000 Genomes Project [11] 
or the large Haplotype Reference Consortium [37], is used. Alter- 
natively, researchers sequence a part of their study cohort and thus 
create their own reference data set. The latter strategy has the 
advantage that one can be certain that the ancestry of the reference 
data matches the ancestry of the study cohort. It is important that 
the reference data be from a population that is similar to the study 
population. If the reference population is too distantly related to 
the study population, the reliability of the imputed data will be 
reduced. The quality and nature of the reference data also limit the 
quality of the imputed data in other ways. A reference data set 
consisting of only a small number of individuals is not able to 
reliably estimate the frequency of rare variants and that in turn 
means that the imputation of rare variants lacks in accuracy. This 
means that there is a natural limit to how low a frequency a variant 
can have and still be reliably imputed. 
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4.2 Imputation 
Software 


4.3 Testing Imputed 
Variants 


The largest publicly available reference data set is the Haplotype 
Reference Consortium (HRC) that combines whole-genome 
sequence data from 20 studies of predominantly European ancestry 
[37]. The first release of this reference panel has data from 32,611 
samples at 39,235,157 SNPs. The large sample size means that 
variants with minor allele frequencies as low as 0.1% can correctly 
be imputed using this data set. 

The use of imputation methods does not only offer the possi- 
bility of increased SNP coverage, but, given the right reference 
data, also eases the analysis of common non-SNP variation, such 
as indels and copy number variations (CNVs). So far some refer- 
ence panels have, however, only include SNVs and disregarded 
indels and structural variants. The increasing quality of whole- 
genome sequencing and software for calling structural variants 
means that better data sets that include structural variants should 
soon become available. Imputation will then make it possible to use 
the SNP chips to test many indels and structural variants that are 
not being (routinely) tested today [38]. 


The commonly applied genotype imputation methods, such as 
IMPUTE2 [39], BIMBAM [40], MaCH-Admix [41], and mini- 
mac3 [42], are all based on hidden Markov models (HMMs). 
Comparisons of these software packages have shown that they 
produce data of broadly similar quality but that they are superior 
to imputation software based on other methodological approaches 
[36, 43]. The basic HMMs used in these programs are similar to 
earlier HMMs developed to model LD patterns and estimate 
recombination rates. 

When the sample size is large, imputation using these 
HMM-based methods imposes a high computational burden. 
One possible way of decreasing this burden is to pre-phase the 
samples so that resolved haplotypes are used as input for the impu- 
tation software instead of genotypes [44]. But even with 
pre-phasing, the computational task is far from trivial, and whole- 
genome imputation is not a task that can be performed on a single 
computer. This computational problem can be solved by using one 
of the two free imputation services that have recently been launched 
(https: //imputationserver.sph.umich.edu, https: //imputation. 
sanger.ac.uk). These services allow users to upload their data 
through a web interface and choose between a set of reference 
panels. The data set will then be imputed on a High Performance 
Computing Cluster, and the user will receive an email when the 
imputed data is ready for download. 


Since imputation is based on probabilistic models, the output is 
merely a probability for each genotype for the unknown variants in 
a given individual. That is, instead of reporting the genotype of an 
individual as AG, say, the program reports that the probability of 
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the genotype being AA is 5%, that of being AG is 93%, and that of 
being GG is 2%. This nature of the output data challenges the 
GWAS. The simplest way of analyzing the imputed data is to use 
the “best guess” genotype, i.e., assume the genotype with the 
highest probability and ignore the others. In the example above, 
the individual would be given the genotype AG at the SNP in 
question, and usually, an individual’s genotype would be consid- 
ered as missing if none of the genotypes have a probability larger 
than a certain threshold (e.g., 90%). The use of “best guess” 
genotype is problematic since it does not take the uncertainty of 
the imputed genotypes into account, may introduce a systematic 
bias, and lead to false positives and false negatives. A better way is to 
report a logistic regression on the expected allele count—in the 
example above, the expected allele count for allele A would be 1.03 
(2paa + pac). This method has proved to be surprisingly robust at 
least when the effect of the risk allele is small [45], which is the case 
for most of the variants found through GWAS. An even better 
solution is to use methods that fully account for the uncertainty 
of the imputed genotypes [45-47]. 


5 Current Status 


After the first GWAS saw publication in 2005, it was followed by 
many more studies, and today almost 4000 such studies of human 
diseases or traits have been published (Fig. 6a). The first GWASs 
had moderate sample sizes with hundreds of samples, but over the 
years the sample sizes and thereby the power of the studies have 
gradually been increasing (Fig. 6b). Imputation and later also next- 
generation sequencing have resulted in a rapid increase in the 
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Fig. 6 GWAS statistics from the NHGRI-EBI GWAS Catalog [63] (accessed June 2017). (a) The cumulative 
number of GWASs published since 2005. (b) The initial sample sizes of the GWASs. For dichotomous traits the 
combined number of cases and controls is shown. Replication samples are not counted. (c) The number of 
tested variants in each study 
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5.1 Polygenic 
Architecture of 
Common Diseases 


5.2 Pleiotropy 


number of variants that are tested in a GWAS (Fig. 6c). All these 
GWASs published in the last decade have increased our knowledge 
about the genetic architecture of common diseases a lot. In this 
section, we will go through some of the insights that have been 
revealed by these studies. 


GWASs have consistently shown that most complex traits and dis- 
eases have very polygenic architectures with a large number of 
causal variants with small effects. The small effect sizes mean that 
enormous sample sizes are needed to detect the associated variants 
and that each variant only explains a small fraction of the heritabil- 
ity. Even though large sample sizes have led to the discovery of 
many loci affecting common diseases, the aggregated effect of all 
these loci still only explains a small fraction of the heritability. 

A good example is type 2 diabetes where researchers by 2012 
had identified 63 associated loci that collectively only explained 
5.7% of the liability-scale variance [48]. Such results led to much 
discussion about the possible source of the remaining “missing 
heritability” [49, 50]. A significant contribution to this debate 
was when researchers in 2010 started using mixed linear models 
to estimate the heritability explained by all common variants not 
only those that surpass a conservative significance threshold. These 
studies showed that a significant fraction of the so-called missing 
heritability was not truly missing from the GWAS data sets but only 
hidden due to small effect sizes. This was first illustrated in height 
where 180 statistically significant SNPs could only explain 10% of 
the heritability, but this fraction increased to 45% when all geno- 
typed variants were considered [51]. 

For common diseases, such analyses have typically shown that 
around half of the heritability can be explained by considering all 
common variants. Given the small individual contribution of each 
of the discovered variants and that the individual contribution of 
the yet to be found variants will be even smaller, it is likely that the 
actual number of causal variants will be much more than a thousand 
for many common diseases. Recent data shows that in many dis- 
eases these causal variants are relatively uniformly distributed along 
the genome. It has, for instance, been estimated that 71-100% of 
1 MB windows in the genome contribute to the heritability of 
schizophrenia [52]. Another article recently estimated that most 
100 kB windows contribute to the variation of height and that 
more than 100,000 markers have an independent effect on height. 
This strikingly large number leads the authors to propose a new 
“omnigenic” model in which most genes expressed in a cell type 
that is relevant for a given disease have a nonzero contribution to 
the heritability of that disease [53]. 


The variants that have been discovered by GWASs so far reveal 
numerous examples where one genetic locus affects multiple often 
seemingly unrelated traits [54, 55]. One explanation for such a 
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shared association between a pair of traits is mediation where the 
shared locus affects the risk of one of the traits, and that trait is 
causal for the other. Another possible explanation is pleiotropy 
where the shared locus is independently causal for both traits. It is 
possible to distinguish between mediation and true pleiotropy by 
adjusting or stratifying for one trait while testing the other. In the 
case of mediation, it is also possible to determine the direction of 
the causation. In general, it is difficult to make such causal inference 
from observational data, but Mendelian randomization, which uses 
significantly associated variants as instrumental variables, can in 
some circumstances be used to assess a causal relationship between 
a potential risk factor and a disease. For instance, Voight and 
colleagues used SNPs associated with lipoprotein levels to assess 
whether the correlation between different forms of lipoprotein and 
myocardial infarction risk was causal [56]. They found that while 
low-density lipoprotein (LDL) had a causal effect on disease risk, 
high-density lipoprotein (HDL) did not. 

The fact that pleiotropy is widespread has several implications. 
One is that variants that have already been found to affect one trait 
can be prioritized in other studies since they are more likely also to 
affect another trait than a random variant is. Another implication is 
that we cannot always examine the effect of selection by studying 
one trait in isolation. There are multiple examples of antagonistic 
pleiotropy where a variant increases the risk of one disease while 
decreasing the risk of another. 


Because of differences in age of onset and severity, we do not expect 
identical allelic architectures in all common diseases. Using the 
currently available GWAS data sets, we can now start to identify 
these differences in the allelic architectures, but because of the 
significant differences in samples sizes and the number of tested 
variants, this is not an easy task. 

The data available to date show that the degree of polygenicity 
differs between diseases with schizophrenia, for example, having 
more predicted loci than immune disorders [57] and hypertension 
[52]. Results also show that rare variants play a larger role in some 
diseases compared to others. Rare variants, for example, have a 
greater role in amyotrophic lateral sclerosis than in schizophrenia 
[58] and are even less important in lifestyle-dependent diseases 
such as type 2 diabetes [59]. 


The price of whole-genome sequencing is still declining, and it is 
not unreasonable to expect that at some point in the future, a 
majority of people will get their genomes sequenced. At that 
point the availability of genetic data will no longer be a limiting 
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factor in studies of common human diseases. In order to make the 
most of such huge data sets, the genetic information needs to be 
combined with high-quality phenotypic and environmental infor- 
mation. If that is achieved, we will be able to explain most—if not 
all—of the additive genetic variance for the common human dis- 
eases. Having large population data sets where genetic data is 
combined with extensive phenotypic data including information 
about lifestyle, diet and other environmental risk factors will also 
enable much better studies of pleiotropy and gene—environment 
interactions. A few large population data sets are already available 
now with the UK Biobank [29 ]—a prospective study of 500,000 
individuals—being the best example. 

While GWASs have found a lot of loci that are associated with 
common diseases, the actual causal variant and the functional 
mechanism driving the causation are still unknown for a large 
fraction of the loci. In order to understand the functional mecha- 
nism of a specific locus, it is necessary to combine sequence data 
with other types of data. This includes gene expression data (from 
the correct tissue) and epigenetic data such as methylation. Such 
data sets are fortunately also becoming cheaper to produce and thus 
more abundant as a result of falling sequencing costs. Furthermore 
large consortium data sets such as GTEx [60], ENCODE [61], and 
Roadmap Epigenomics [62] mean that each lab studying these 
mechanisms will not have to produce all the data themselves but 
can in part rely on these public data sets. It is thus likely that we in 
the future not only will find many more GWAS loci for each 
common disease but we will also have a much better understanding 
of how each of these loci affects the disease. 


1. How can you distinguish causal variants from other variants 
when all variants have been typed? Is there any statistical way of 
distinguishing between correlation and causality just from 
genotype data? Could you use functional annotations? 


2. Consider a GWAS data set, where in the top ten ranked statis- 
tics you have five markers that are close together and the 
remaining five scattered across the genome. Would you con- 
sider the five close markers more or less likely to be a true 
positive? Why? If one of them is a false positive, what would 
you think about the others? 


3. Why is the RR but not the OR estimate affected by a biased 
case/control sample? 


4. How would you test for, e.g., dominant or recessive effects in a 
contingency table? 
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Abstract 


Borrowing both from population genetics and phylogenetics, the field of population genomics emerged as 
full genomes of several closely related species were available. Providing we can properly model sequence 
evolution within populations undergoing speciation events, this resource enables us to estimate key 
population genetics parameters such as ancestral population sizes and split times. Furthermore we can 
enhance our understanding of the recombination process and investigate various selective forces. With the 
advent of resequencing technologies, genome-wide patterns of diversity in extant populations have now 
come to complement this picture, offering an increasing power to study more recent genetic history. 

We discuss the basic models of genomes in populations, including speciation models for closely related 
species. A major point in our discussion is that only a few complete genomes contain much information 
about the whole population. The reason being that recombination unlinks genomic regions, and therefore a 
few genomes contain many segments with distinct histories. The challenge of population genomics is to 
decode this mosaic of histories in order to infer scenarios of demography and selection. We survey modeling 
strategies for understanding genetic variation in ancestral populations and species. The underlying models 
build on the coalescent with recombination process and introduce further assumptions to scale the analyses 
to genomic data sets. 


Key words Ancestral population, Coalescence, Demography, Divergence, Markov model, Migration, 
Recombination, Selection, Speciation 


1 ‘Introduction 


We are in the population genomics era where data sets from the 
1000 human genomes project [1], the great apes project [2], and 
the 1001 arabidopsis genomes project [3] are available. The under- 
lying data sets contain genotypic information for thousands of 
individuals in one or several species, in the form of de novo 
sequenced genomes or variation compared to an available “refer- 
ence” genome (a.k.a. resequencing). By comparing genomes from 
several individuals of the same species or closely related species, we 
can obtain information about split times, population sizes, recom- 
bination events, and selection in contemporary and ancestral 
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Recombination event 
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Fig. 1 Left: Isolation model of two species. Right: The coalescent process along the genomes of the two 
species. By comparing the two genomes we obtain information about the split time of the species and the 
ancestral population size. Furthermore the breakpoints along the genomes correspond to recombination 
events, so we also have information about the recombination process 


species (see Fig. 1). In this chapter we discuss various models for 
obtaining this information. 

Comparing homologous sequences available for a given locus 
to infer their degree of relatedness enables the discovery of the 
parental relationships of the sequences, depicted as a tree thereby 
named genealogy. When one sequence sampled from one individual 
of one species is compared with sequences from other species, the 
resulting genealogy contains information about the history of spe- 
cies, the so-called phylogeny. The phylogeny summarizes the rela- 
tionship and the divergence times between the species. 

Conversely, when sequences from several individuals within a 
species are sampled, we have access to the genetic variation in 
contemporary populations. The evolutionary forces that shape 
genetic variation within a species are genetic drift, mutation, 
recombination, and selection and are the subject of population 
genetics. The key modeling tool in population genetics is coales- 
cent theory. Classical coalescent theory describes the genetic ances- 
try of a sample of homologous DNA sequences from the same 
species. This genealogical description includes times to common 
ancestry, which is measured back into the past. 

Molecular phylogenetics and population genetics have accu- 
mulated 50 years of methodological developments. The conver- 
gence of these two fields and their key mathematical and statistical 
tools is needed in order to fully understand genomic sequence 
alignments, because comparing genealogies and phylogenies is at 
the heart of the study of the speciation process [4]. 

We describe the interplay between population genetics and 
phylogenetics by reviewing the methods and models that have 
been developed to understand evolutionary history from genomic 
data (see Table 1 for a comparative summary of all methods). 
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2 Coalescent Theory and Speciation 


2.1 The Standard 
Coalescent Model 


We start by describing the standard coalescent model within one 
population. The coalescent model describes the shape of the gene- 
alogy of several sequences sampled from a single population. For 
more information on the coalescent, we refer to [21, 22] and 
[23]. This section describes the coalescent process as a chronologi- 
cal process. In the next section, we will see how it can be modeled as 
a spatial process along the genome. In subsequent sections we 
extend the standard model to include two or more populations. 
In the cases where multiple populations are present we describe 
both the isolation model and the isolation-with-migration model. 


The standard coalescent model is a continuous-time approximation 
of the neutral Wright—Fisher model. In the Wright—Fisher model 
the number of chromosomes 2 N (we consider diploid organisms) is 
fixed in each non-overlapping generation. Each chromosome in a 
new generation chooses its ancestor uniformly at random from the 
previous generation. 

Consider two chromosomes. The probability of the two chro- 
mosomes choosing the same ancestor is 1/(2 N) and the probabil- 
ity of the two chromosomes not finding a common ancestor is 
l — 1/(2N). Let Rə denote the number of generations back in 
time when the two individuals find a most recent common ancestor 
(MRCA). By repeating the argument above, the probability of the 
two chromosomes not finding a common ancestor r” generations 
back in time is 


If we scale time żin units of 2N, i.e., set r = 2 Nt, we get 


l D l 2Nt pae. 
PR ai Deel Del set 
where the approximation is valid for large N. In coalescent time 
units the waiting time 7, = R2/(2N) before coalescence of two 
individuals is therefore exponentially distributed with mean one. 
These considerations can be extended to multiple individuals. 
In general the time T, before two of a individuals coalesce is 
exponentially distributed with rate (%). 
The waiting time W, for a sample of a individuals to find the 
most recent common ancestor (MRCA) is given by 


W,=T,+T,-1+-::+ Ta, 


where T, are independent exponential random variables with 


parameter (4) ; see Fig. 2 for an illustration. It follows that the 


mean of W,, is 
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Ts 


Fig. 2 Illustration of the coalescent process. The waiting time before two out of n individuals coalesce is 7, and 
the time before a sample of n individuals find common ancestry is W, 


“2 “yr l 1l 
d 


=2(1 SE 


Note that Jm, El W,,] = 2. 
The variance of W, is 


1 1 1 
=8) 2 dÉ d (s ! d 
Note that lim,_,.Var[W,] = (£ — 12) = 1.16. 

The consequences of these calculations are that when we only 
sample within a population we are limited to relatively recent 
events. The expected time for a large sample to find their MRCA 
is approximately 2 x (2N) = 4N generations with standard devia- 
tion v1.16 x (2N) =2.15N generations. As a consequence, a 
neutral sample within a population contains little information 
beyond 6N generations. 

Humans have a generation time of approximately 20 years and 
an effective population size of approximately N = 10, 000 (see [21, 
p. 251]), and therefore 6N generations correspond to approxi- 
mately 1.2 million years (My) for humans. Therefore human 


2.2 Adding 
Mutations to the 
Standard Coalescent 
Model 


2.3 Taking 
Recombination into 
Account 
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diversity at neutral loci contains little demographic information 
beyond 1.2 My. 


Now suppose mutations occur at a rate u per locus per generation. 
In a lineage of r generations, we then expect ru mutations or in the 
coalescent time units with r = 2 Nt we expect 2 Ntu mutations. We 
let 0 = 4 Nu be the mutation rate parameter. Since # is small we can 
make a Poisson approximation of the binomial number of muta- 
tions in a lineage of r generations 


Bin(7,z) = Bin(2 Nt,0/(2 -2.N)) ~ Pois(t@/2). 


We have thus arrived at the following two-step process for 
simulating samples under the coalescent: (a) simulate the genealogy 
by merging lineages uniformly at random and with waiting times 
exponentially distributed with rate (7) when 7 lineages are present; 
(b) on each lineage in the tree add mutations according to a Poisson 
process with rate 0/2. 

Another possibility is to scale the coalescent process such that 
one mutation is expected in one time unit. In this case the expo- 
nentially distributed waiting times in (a) have rate (7) (2/0), and in 
(b) the mutations are added with unit rate. We use the latter version 
of the coalescent-with-mutations process below. 


For species where recombination occurs, different parts of the 
genome come from distinct ancestors, and therefore have a distinct 
history. Figure 3 exemplifies this phenomenon for two species. It 
displays the genealogical relationships for two sequences which 
underwent a single recombination event. In the presence of recom- 
bination, each position of a genome alignment therefore has a 
specific genealogy, and close positions are more likely to share the 
same one (recall Fig. 1). The genome alignment can therefore be 
described as an ordered series of genealogies, spanning a variable 
amount of sites, and then changing because of a recombination 
event [4]. The genealogy is therefore depicted as a complex graph 
with nodes representing both coalescence and recombination 
events, the ancestral recombination graph (ARG, Fig. 3c). A single 
genome thus contains different samples from the distribution of the 
age of the MRCA, and the distribution contains information about 
the ancestral population size and speciation time. The coalescent 
with recombination serves as a basis for modeling genome-wide 
genealogy, a point that we will further develop in Subheading 4. 


3 Adding Genetic Barriers and Gene Flow to the Picture: The Structured Coalescent 


In this section we extend the standard coalescent model. We con- 
sider coalescent models with multiple species and introduce popu- 
lation splits or speciation events. The models that we describe are 
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A) 


1 


2 


B) 


3 4 1 2 3 4 


Left tree Right tree 


Web 


Fig. 3 Ancestral recombination graph for two species. (a) Genealogy of four sampled sequences from two 
species. The bold line shows the divergence of two sequences of interest. (b) A single recombination event 
happened between the lineages of sequences 3 and 4 (horizontal line), so that in a part of the sequences, the 
genealogy is as depicted by the bold line and therefore displays an older divergence. (c) The corresponding 
ancestral recombination graph (in black) with the trees of each side of the recombination break point 
superimposed (red: left tree; blue: right tree). When going backward in time, a split corresponds to a 
recombination event and a merger to a coalescence event 


3.1 Isolation Model 
with Two Species 


shown in Fig. 4 (see also Table 1) and include: (a) The two species 
isolation model; (b) The two species isolation-with-migration 
models; (c) The three species isolation model (and incomplete 
lineage sorting); and (d) The three species isolation-with-migration 
model. We also discuss the general multiple species isolation-with- 
migration model. The two species isolation model was introduced 
in [24] and the isolation-with-migration model was introduced 


in [25]. 


If the sequences are sampled from two distinct species that have 
diverged a time Tago (see Fig. 4a), then the distribution of the age 
of the MRCA is shifted to the right with the amount T, resulting in 
the distribution 
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Fig. 4 Speciation models and associated parameters. In all exemplified models effective population size is 
constant between speciation events, represented by dash lines. The timing of the speciation events, noted 
Tare parameters of the models, together with ancestral effective population sizes, noted Ny. In some cases, 
contemporary population sizes can also be estimated, and are noted N;, where jis the index of the population. 
Models with post-divergence genetic exchanges have additional migration parameters labeled ken, A, The 
number of putative migration rates increases with the number of contemporary populations under study, and 
some models might consider some of them to be equal or eventually null to reduce complexity. (a) Isolation 
model with two species. (b) Isolation-migration model with two species. (c) Isolation model with three species. 
(d) lsolation-Migration model with three species 


0 ift<T 
äich ENEE 
0a 


where 04 = 4 N4 ; uis the ancestral mutation rate. The mean time 
to coalescent is E[ T2] = T + 04/2 and the average divergence time 
between two sequences is twice this quantity, that is, 2T + 04. Since 
04 = 4N4au it follows that the larger the size of the ancestral 
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3.2 Isolation Model 
with Three or More 
Species and 
Incomplete Lineage 
Sorting 


population, the bigger the difference between the speciation time 
and the divergence time. 

The variance of the divergence time is Var[T2] = 07, /4. With 
access to the distribution of divergence times, we could estimate the 
speciation time and population size from the mean and variance of 
the distribution. Unfortunately we do not know the complete 
distribution of divergence times and it is not immediately available 
to us, because long regions are needed for precise divergence 
estimation but have experienced one or more recombination 
events. 


Now consider the isolation model with three species depicted in 
Fig. 4c. Such a model is often used for the human—chimpanzee—- 
gorilla (HCG) triplet (eg, [10-12]). 

The density function for the time to coalescence between 
sample 1 and sample 2 is given by 


> g720-T1)/0a if Tı < t< Tú 
a = Al (1) 
P2 e 2e-TuV0a if t > Ty, 
02 
where 


Ty =Tı+Tə and P= e72(Ti2-T1)/6a1 


is the probability of the two samples mot coalescing in the ancestral 
population of sample 1 and sample 2. In the upper right corner of 
Fig. 5 we plot the density (Eq. 1) with parameters that resemble the 
HCG triplet. 

If sample 1 and sample 2 do not coalesce in the ancestral 
population of sample 1 and sample 2, then the three trees 
((1,2),3), ((1,3),2), and ((2,3),1) are equally likely. The probability 
of the gene tree being different from the species tree is thus 


Pr(incongruence) = Sg = Z He Din, (2) 
The event that the gene tree is different from the species tree is 
called incomplete lineage sorting (ILS). ILS is important because 
species tree incongruence often manifests itself as a relatively clear 
signal in a sequence alignment and thereby allows for accurate 
estimation of population parameters. In Fig. 6 we show the (in) 
congruence probability Eq. 2. We also refer to Exercise 1 (see 
Subheading 8.1) and Exercise 2 (see Subheading 8.2) for more 
discussion of ILS. 
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Fig. 5 Illustration of the density for coalescent in various models and data layout. The curves are the 
probability density functions. In the most simple case with two species, a constant ancestral population 
size and a punctual speciation (top left panel), more genomic regions find a common ancestor close to the 
species split (the vertical line), while a few regions have a more ancient common ancestor, distributed in an 
exponential manner (see Eq. 1). If speciation is not punctual and migration occurred after isolation of the 
species, then some sequences have a common ancestor which is more recent than the species split and the 
distribution in the ancestor becomes more complex (bottom left panel, see Eqs. 4 and 6). When a third species 
is added (right panel), then another discontinuity appears and all distributions depend on additional para- 
meters, particularly when migration is allowed. We use d. = 0.0062, 042 = 0.0033 and z4 = 0.0038 (the 
first vertical line), z2 = 0.0062 (the second vertical line) corresponding to the HCG triplet. Ancestral population 
sizes are taken from the simulation study in Table 6 in Wang and Hey [8]: 6; = 0.005 and 02 = 0.003. 
Migration parameters are all set to 50 


In the three species isolation model the mean coalescent time 
for a sample from population 1 and a sample from population 2 is 
given by 


0 d 
FIT: =T)4 (l P12) Ge + Du: ~ (3) 
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3.3 Isolation-with- 
Migration Model with 
Two Species and Two 
Samples 


Incomplete Lineage Sorting 
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Fig. 6 Probability (Eq. 2) of gene tree and species tree being incongruent. In case 
of the HCG triplet we obtain (Tyo — hu = (0.0062 — 0.0038)/ 
0.0062 = 0.39 which corresponds to an incongruence probability of 30% 


Burgess and Yang [9] describe the speciation process for 
human, chimpanzee, gorilla, orangutan (O), and macaques 
(M) using an isolation model with five species. The HCGOM 
model contains four ancestral parameters Grace, Once, Oucao, and 
Ouccom. In this case (Eq. 3) extends to 


uc 
2 


uc 
ET] = Tuc + (1 — Puc) a Doc) — Puce) 
Onco 


+PycPuce(1 — Puceo) ` 


0 
+PucPucePucco(l — Puccom) SS 


The isolation-with-migration (IM) model with two species is 
shown in Fig. 4b. The IM-model has six parameters: The mutation 
rates 0), 62, and 04, the migration rates mı and mz, and the 
speciation time T. We let O = (01, 02, 04, mn, M2, T) be the vector 
of parameters. 

Wang and Hey [8] consider a situation with two genes. Before 
time T the system is in one of the following five states: 


Si; ` Both genes are in population 1. 
$22: Both genes are in population 2. 
Aua: One gene is in population 1 and the other is in population 2. 
Kb The genes have coalesced and the single gene is in population 1. 


Sy: The genes have coalesced and the single gene is in population 2. 
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The instantaneous rate matrix Q is given by 


S11 S12 Da Sı Kë 


S11 R Zi: 0 2/0, 0 


Si2llm, - mal 0 0 
S22 0 2mMı : 0 2/02 
S&l0 0 0 mo 


S20 0 Oj} m 


Starting in state a, the density for coalescent in population | at time 
t < Tis given by [26] 


Falt) = (easy, (2/0), (4) 
the density for coalescent in population 2 at time t < Tis 
Falt) = (eas, (2/02), (5) 
and the total density for a coalescent at time t < Tis 
FO) =f) + fre). (6) 


Here e4 = 37 Zu A’/(i!) is the matrix exponential of the matrix 
A and (Lee is entry (2, 7) in the matrix exponential. 

After time The system only has two states: $44 corresponding 
to two genes in the ancestral population and S, corresponding to 
one single gene in the ancestral population. The rate of going from 
state S44 to state S4 is 2/04. The density for coalescent in the 
ancestral population at time t > Tis therefore 


2 S 
F(t) = Waffen Xe a, H kel ge MRM. 


(7) 
In Fig. 5 we illustrate the coalescent density in the two species 
isolation-with-migration model. 


The likelihood for a pair of homologous sequences X is given 
by 


P(X|@) = L(@|X) = [ P(X|e)f(t|0) de (8) 


where f(t) = f(t|O) given by Eqs. 6 and 7 is the density of the two 
sequences finding a MRCA at time rand P(X]Z) is the probability of 
the two sequences given that they find a MRCA at time t. The latter 
term is calculated using a distance-based method. One possibility is 
to use the infinite sites model where it is assumed that substitutions 
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3.4 Isolation-with- 
Migration Model with 
Three or More Species 
and Three or More 
Samples 


happen at unique sites, i.e., there are no recurrent substitutions. In 
this case the number of differences between the two sequences 
follows a Poisson distribution with rate 1. 

For an application of the isolation-with-migration model with 
two sequences, we refer to [8 ]; a discussion of their approach can be 
found in [27]. 


Hey [28] considered the multipopulation isolation-with-migration 
(IM) model. Recall from Fig. 4b that the two-population IM model 
has six parameters: two present population sizes, one ancestral 
population size, one speciation time, and two migration rates. 
The three-population IM model in Fig. 4d has fifteen parameters: 
three present population sizes, two ancestral population sizes, two 
speciation times, and eight migration rates. In general a k-popula- 
tion IM model has 3k — 2 + 2(k — 1)? parameters: 


e kpresent population sizes, 

e (k — 1) ancestral population sizes, 
e (k — 1) speciation times, and 

e 2(k— 1) migration rates. 


See Fig. 5 for an example of divergence distribution with three 
species and migration and Exercise 3 (see Subheading 8.3) for a 
derivation of the number of migration rates in the general k-popu- 
lation model. For k = 5, 6, and 7 we obtain 45, 66, and 91 para- 
meters. Because the number of parameters becomes very large even 
for small k, Hey [28] suggests adding constraints to the migration 
rates, e.g., setting some rates to zero or introducing symmetry 
conditions where rates between populations are the same. 


4 Approximating the Coalescent with Recombination Along Genomes 


Before the genomic era, multilocus population genetics models 
were addressing a small fraction of the complete ancestral recombi- 
nation graph (ARG) by considering independent loci. As sequenc- 
ing technologies evolved and allowed access to larger samples of 
genomic diversity, this independence assumption had to be relaxed 
and more explicit modeling of the ARG was required. Yet the 
complexity of the coalescent with recombination process makes its 
application to genome-scale data sets very challenging. Two direc- 
tions of analysis methods have emerged: simulation-based or spatial 
approximations along the genome. In this chapter we focus on the 
latter and refer to Kelleher et al. [29] and Staab et al. [30] for the 
former. Simonsen and Churchill [31] described the first model of 
the joint distribution of genealogies at two loci for two genomes. 
Wiuf and Hein [32] extended this approach and described the 
coalescent as a spatial process along the genome. McVean and 


4.1 The Independent 
Loci Approach: Free 
Recombination 
Between, No 
Recombination Within 


4.2 State-Space 
Model: 
Simonsen-Churchill 
Framework 
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Cardin [33] further approximated the description with a Markov 
process. In this section we describe and discuss these types of 
approximations. 


The simplest way to handle issues relating to the ancestral recom- 
bination graph is to divide the data into presumably independent 
loci. Such analyses are therefore restricted to candidate regions that 
are not too large (to avoid including a recombination point) and 
not too close (to ensure several recombination events happened 
between loci). Each region can then be described by a single 
underlying tree, reducing the analytical and computational load. 

Using 15,000 loci distant from 10 kb totaling 7.4 Mb and the 
isolation model introduced above, Burgess and Yang [9] (Table 2, 
model (b) sequencing errors) find the following ancestral popula- 
tion sizes and speciation times estimates for human (H), chimpan- 
zee (C), gorilla (G), orangutan (O), and macaque (M) ancestors: 
Onc = 0.0062, Once = 0.0033, ncco = 0.0061, PHcGom 
= 0.0118 and Tac = 0.0038, Taca = 0.0062, Taco 
= 0.0137, Taccom = 0.0260. Converting these estimates into 
time units requires an estimate of the substitution rate, either 
absolute or deduced from a scaling point. Using u = 107° as an 
estimate for substitutions per year, this leads to an estimate of 3.8 
My for the human-chimpanzee speciation, a very recent estimate. 
Using the same data, Yang [10] showed that the isolation-with- 
migration model was preferred. Yang finds a more ancient specia- 
tion time Tue = 0.0053 (5.3 My with u = 107°) when migration 
is accounted for. 


The coalescent with recombination for two loci and two sequences 
is originally described in Simonsen and Churchill [31] as a 
continuous-time Markov chain backward in time with eight states 
as shown in Fig. 7. This Markov chain is given a careful treatment in 
the textbooks by Durrett [34, Section 3.1.1] and Wakeley [21, 
Section 7.2.4], and we therefore only briefly explain the basic 
properties of the model here. 

A single sequence is either linked (ee x—e, ex, or xx 
meaning that it contains material ancestral to the sample at both 
loci, or it is unlinked (e-, —e, —x, or x—) when it contains material 
ancestral to the sample at only one locus. The coalescent rate is one 
for any two sequences, and the recombination rate is p/2 for any 
linked sequence. The chain begins at time zero in state 1 with two 
linked sequences. After an exponential waiting time with rate 1 + p 
the chain enters state 8 with probability 1/(1 + p) or state 2 with 
probability p/(1 + p). The transition from state 1 to state 8 is a 
coalescent event, and the left and right tree heights are identical. 
The transition from state 1 to state 2 is a recombination event that 
breaks apart one of the two sequences. All other transitions have 
similar interpretations. Common ancestry for a locus is marked 
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Fig. 7 State transition diagram for two loci and two sequences described as a continuous-time Markov chain 
backward in time. The figure is adapted from Figure 7.7 in Wakeley [21]. A line with a bullet or a cross at both 
ends is a linked sequence (ancestral material to the sample at both loci), whereas a line with a bullet or a cross 
at one end only is a sequence with ancestral material at one locus only. A cross denotes common ancestry. 
sand t denote the heights of the left and right trees, respectively 


with a x, so the transition from, e.g., state 1 to state 8 is a transition 
to the state x—x. 

The height S of the left tree is the first time at which the process 
enters one of the states 5, 7, or 8 (states with a left x), and the 
height Tof the right tree is the first time at which one of the states 
4, 6, or 8 is entered (states with a right x). When state 8 is entered 
from state 1 the two tree heights are identical. State 8 is absorbing 
because only the tree heights are of interest. 

The two key ingredients for the state-space model are the 
conditional probability for staying in a state P(T = s|S = s) and 
the conditional density at el of a new tree height ¢ conditional on 
a change and a previous tree height s. Hobolth and Jensen [35] 
show that the conditional probability of no change from the left to 
the right tree is 


P(T =s|S = s) =F eis (9) 


and the conditional density g(¢|s) of T given S= s and given 
TH Sis 
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fl tel Ch ER [eh t<s, 
e= — [ely 
q(¢|s) = de (10) 
elt leh + [eh t>s, 


e" re 


where A denotes the 8 x 8 rate matrix from Fig. 7. 

Wakeley [21, Section 7.2.4] noted that the transitions between 
state 4 and 6 and the transitions between state 5 and 7 can be 
removed from the chain if we are only interested in the tree heights. 
Actually, even more transitions can be removed from the chain. 
Note from Eqs. 9 and 10 that we only need the entries (1, 1), (1, 2), 
and (1, 3) in ell for calculating the probability of the same tree 
height in the next position and the transition density conditional on 
a change. These entries can be found from a reduced rate matrix 
where states 4, 5, 6, and 7 are removed and the rate from states 
2 and 3 to a new absorbing state equals 2. In other words, define 
the reduced rate matrix 


—(1 +p) p 0 1 

jz 1 —(3+p/2) p/2 2 
0 4 -6 2 fP? 

0 0 0 o 


where states are numbered 1, 2, 3, and 4. The holding time and 
transition density for the model are now given by Eqs. 9 and 10 
with A substituted by A. 

In the left plot in Fig. 8 we illustrate the probability (Eq. 9) of 
the same tree height in the left and right loci conditional on the 
tree height in the left locus and different recombination rates. 
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Fig. 8 (a) Probability of same tree height. (b) Density for right tree height conditional on the left tree height 
being equal to s and a recombination rate equal to p 
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As expected the probability for identical tree heights decreases with 
the height of the left tree and with the recombination rate. 

In the right plot in Fig. 8 we illustrate the density (Eq. 10) of 
the right tree height conditional on the left tree height and a 
change in tree height. When the recombination rate increases, the 
density for the right tree height moves toward smaller tree heights. 
The reason is that at least one recombination is needed for having a 
change in tree height. We also observe that the density is continu- 
ous but not differentiable in the position of the left tree height. 


4.3 Time Li and Durbin [14] and Mailund et al. [13] analyze pairs of 
Discretization: Setting sequences using a hidden Markov model (HMM). The hidden 
Up the Finite State states are tree heights (times to the most recent common ancestor), 
HMM and the tree height is discretized to obtain a finite hidden state 


space. The observed states of the HMM are alignment columns, 
with probabilities corresponding to a substitution process on the 
tree (see Fig. 9). In the Li and Durbin model, an infinite site model 
is assumed and observed states are converted to binary data, telling 
whether the site is heterozygous (one mutation) or homozygous 
(no mutation). 

We now describe how we discretize time for the case of two 
sequences considered in the previous section. The discrete version 
of the Markov process is used to build a finite Markov chain along 
the two sequences. When the finite Markov chain is combined with 
a substitution process, we obtain an HMM as in Li and 
Durbin [14]. 

Let the discrete time points (backward in time) of the Markov 
chain be dọ = 0 < d < d < --- < dai < dm= œ and denote 
the corresponding states by 1, 2,..., M. State m (m €{1,..., M}) 
then corresponds to a tree height in the interval between d 1 and 
dm. The continuous stationary distribution is z(t) = exp(—t), and 
therefore the discrete times are chosen such that 


a Recombination process p b 
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Fig. 9 (a) Graphical structure of the hidden Markov Model. (b) Simulation from the hidden Markov model 


4.4 Careful 
Treatment of Mutation 
Process 
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1 — exp(—d,,) = m/M, or dm = —log(1 — m/M), where we define 


log(0) = ce. 
We now get for 1 < ¢, r < M the joint density 
P(L=£,R=r) 
de- 

oa E O "le 

Ee erger Aer 
E Xava aee iff=r 

P(L=7r,R=€) HEN 


(11) 


The reason for the first case is that in order for the left tree height to 
be in state £ < 7, it must be in state 1, 2, or 3 at time d¢_ and in 
state 5 or 7 at time ge De, there have been no coalescent events 
before time de and a left coalescent event between time de 1 and 
ar), and similarly it must still be in state 5 or 7 at time d 1 and in 
state 8 at time d (De, there have been no coalescent events between 
time de and time d and a right coalescent event between time 
d,_, and time d,). The next case corresponds to no coalescent 
events before time de 1 and both a left and a right coalescent 
event between time de 1 and de. The last case is due to symmetry 
of the chain. 

From the joint tree states (£, 7) we easily get the conditional 
tree states 


P(L=€,R=r) 
P(L= " 


where P(L = £) ==,P(R = r, L = £). These probabilities are used 
in the HMM. 


Pen = Pre) = P(R=7|L= fi 


A careful treatment of the mutation process allows for a more 
coarse binning procedure and is needed to avoid biasing the results. 
In continuous time the probability for a mutation given a tree 
height ż is given by y(t) = 1 — exp(—@r), and the stationary tree 
height distribution is z(t) = exp(—t). The probability of a muta- 
tion conditionally on the hidden state m becomes 
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4.5 Statistical 
Inference of 
Population Parameters 
from Sequences 


4.5.1. Summary 
Statistics: Runs of 
Homozygosity and Pair 
Correlation 


Mm = pO; = le = m) 
Di, = reg Ae dm)) 
DULrElgda 1 Am)) 


dm dm 
py; = l|t)a(t)dt | (l—e)e‘dt 
1 = dm- ( 12) 

dm dm 

| n(t)at | edt 
dm-1 dm-1 

EWEG HEET D 
ati abhes (1 = 0+0 D) 


(1 + 0)(1 — en GE 


Note that with a fine discretization we have that the interval d — 
di is small and the first-order Taylor expansion 
exp(—az) ~ l — az for z small gives 


Ply, = Li = ls) een 

as perhaps expected. We are, however, discretizing the interval [0, 
oo|, so it is not possible to avoid one or more large bins. Generally 
we have found that a careful treatment of the mutation process is 
crucial for accurate inference [36]. 


Here we choose to focus on three inference methods for estimating 
the recombination rate. The first method is based on the full 
likelihood obtained from the classical forward (or backward) algo- 
rithm for HMMs. The second is based on the distribution of the 
distance between segregating sites. This summary statistics was 
used in Harris and Nielsen [37] for demographic inference. It is 
sometimes also described as the distribution of the distance 
between heterozygote sites, runs of homozygosity, or the nearest- 
neighbor distribution. The third summary statistics is the probabil- 
ity that two sites at certain distance apart are both heterozygote 
sites. This probability is closely related to the pair correlation func- 
tion from spatial statistics [36] and to the zygosity correlation 
introduced in [38]. 


Recall that in continuous time the probability for a mutation given 
a tree height ris given by p(t) = 1 — exp(—6r), and the stationary 
tree height distribution is z(t) = exp(—r). The marginal probability 
for a mutation is therefore given by 


| noaa =0/0 +0), (13) 


a 


Q 
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Fig. 10 (a) Stationary distribution of tree height conditional on a mutation. (b) Probability of a mutation at 
various distances away from a mutation. (c) Probability of the first mutation at various distances away from a 


mutation 


We also get the stationary distribution 


g= See at Zen em 


H(t)a(t)dt 
0 
for a tree height ¢ conditional on a mutation. Figure 10a shows dt" 
for different values of 0. Note that small mutation rates imply a 
higher tree height when we condition on a mutation. In discrete 
time the probability for a mutation given a tree height m was given 
by Eq. 12. Let w= (m, ..., Mas) be the vector of mutation 
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4.5.2 Parameter 
Estimation 


probabilities. The stationary distribution @ = (an, ..., Ọm) for a 
state m conditional on a mutation is given by 


Hete 


pe ES M >, 
Hmm 
m=1 
where za = 1/M because this is how the time discretization was 
chosen. 
The probability for a mutation at a distance 7 from a typical 
mutation is then given by 


vil zé Pn, 
where ' denotes vector transpose. In Fig. 10b we show x(r) as a 
function of o and 0. Note that the curves converge to 0/(1 + @) and 
that the behavior for small v is determined by the 


recombination rate. 
The distribution of runs of homozygosity is given by 


U(r) = ġ'[Pdiag(e — ui "Dn. 


Here e = (1,..., 1) is the vector of length M with 1 in every entry 
and diag(e — n) is the diagonal matrix with e — yp on the diagonal. 
In Fig. 10c we show y(r) as a function of p and 8. 


We estimate the mutation rate using an estimating equation based 
on the marginal probability for a mutation (Eq. 13). Ifthe observed 
frequency of a mutation is D, then the mutation rate is 
Ô = p/(1 — p) (see left plot in Fig. 11). The recombination rate is 
estimated using maximum likelihood for the HMM and goodness 
of fit for the pair correlation (see middle plot in Fig. 11) and runs of 
homozygosity (see right plot in Fig. 11). 

We simulated 50 sequences of length 20,000 base pairs and 
with mutation rate 0 = 0.1 and recombination rate p = 0.1. We 
estimated the mutation rate using the estimating equation and the 
recombination rate using maximum likelihood and the HMM, and 
goodness of fit for the pair correlation and nearest neighbor 
(Fig. 12) [35]. As expected the HMM procedure shows the best 
results because here we are using all the available information. It 
seems, however, that we are not losing too much power when 
applying the pair correlation function. This is in contrast to the 
nearest-neighbor summary statistics that perform much worse than 
the other two methods. 

We have provided a detailed treatment of the main components 
involved in an analysis of pair of DNA sequences based on an HMM 
derived from coalescent theory. Pairwise sequentially Markov coa- 
lescent (PSMC) models have been extensively applied to various 
organisms, see, for instance [39-43 ]. 
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Fig. 11 Parameter estimation for summary statistics. (a) The mutation rate 6 is estimated from the observed 
number of mutations and length of the region. (b) The recombination rate o is estimated using the empirical 
distribution of a mutation at various distances from a mutation. (c) The recombination rate is estimated using 
the empirical distribution of the first mutation from a mutation 


5 Extending the Pairwise Sequentially Markov Coalescent 


Extending the SMC to more than two genomes has proved to be 
challenging. The number of hidden states becomes prohibitive, as 
several divergence times have to be modeled and combined with 
distinct possible topologies. Further simplifications are therefore 
needed to account for an increasing number of genomes. 
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Simulation study 
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Fig. 12 Results of parameter estimation for simulation study. The pair correlation summary performs rather 
well compared to the full HMM data analysis. Nearest neighbor is a poor summary statistics 


5.1 From 2 ton Schiffels and Durbin [15] proposed to extend the PSMC model 
Genomes [14] to more than two haploid genomes by modeling the most 
recent coalescence event in the sample. In this framework, the 
hidden states of the model are a combination of divergence times, 
taken from a discretized distribution, and identity of the 
corresponding haplotypes involved. The rationale for such simplifi- 
cation was that the PSMC showed poor resolution in the recent 
past [14], and considering more genomes would bring additional 
signal. The drawback of this implementation is that the more 
genomes are considered, the more “shifted” toward the present is 
the timeframe where population parameters can be inferred. As a 
result, the authors reported that with more than 8 diploid indivi- 
duals (16 haploid genomes), parameters can virtually not be esti- 
mated (see also [44] for an illustration of this effect with 
simulations). Another consequence of this approach is that the 
recombination rate parameter cannot be reliably estimated 
[15]. The MSMC was used to infer the recent history of human 
population. In particular, the authors introduced the possibility to 
label individuals and look at cross-coalescence rate between groups, 
a way to get a fine-tuned view of population divergence [15, 45]. 


SAIT The Multiple 
Sequentially Markov 
Coalescent (MSMC) 


5.1.2 The Demographic 
Inference with Composite 
Approximate Likelihood 
(diCal) 


5.1.3 Extending the SMC 
with Conditional Site 
Frequency Spectra (CSFS) 
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An alternative approach was introduced by Song and colleagues 
[16-18]. The demographic inference with composite approximate 
likelihood (diCal) approach is based on the conditional sampling 
distribution, which computes the likelihood of one genome 
conditioned on the observation of others. Using the so-called 
composite likelihood formula, it is therefore possible to compute 
the likelihood of the data for n genomes as the product of the 
likelihood of one genome given the 2 — l other ones and the 
likelihood the remaining 7 — 1 genomes: 


PD Jet = Pr(Di|D2..n, 0) x PU: el, 


where @ is the set of model parameters and D___,, denotes the data 
set with n genomes. By further noting that 


DD: Je = P(D2|D3..n5 0) x P(D3...n-1|0) 


the likelihood of the full data set can be computed by recursion. 
The terms P(D,|Dj.1...,) form the conditional sampling distribu- 
tion (CSD). Paul et al. [16] proposed a way to compute the CSD at 
the cost of introducing several additional hypotheses: (a) the hap- 
lotypes upon which the sample is conditioned are considered inde- 
pendent, that is, no coalescence events involving these haplotypes 
are allowed and (b) mutations can only occur once in any lineage 
(infinite site hypothesis). The likelihood resulting from this 
approximated CSD is therefore not exact. This approach was intro- 
duced by Li and Stephens [46] and is referred to as the product of 
approximate conditionals (PAC) model. Under the PAC model, 
the likelihood depends on the order by which the data is 
conditioned, which can be circumvented with permutation proce- 
dures. While the CSD-based SMC does not have the same draw- 
backs as the MSMC of Schiffels and Durbin [15], its computational 
efficiency decreases as the number of haplotypes considered 
increases and becomes impractical for more than 10 genomes 
[19]. An elegant feature of the diCal approach is that it can be 
extended to more complex demographic models, including popu- 
lation structure and gene flow [18, 45]. Such extension is of 
interest as the SMC approximation has been shown to be sensitive 
to strong population structure [47]. 


In order to use the large amount of data available in “1000 gen- 
omes” projects, Terhorst et al. [19] extended the PSMC in a 
different direction. Instead of modeling the genealogy of the com- 
plete sample, the authors proposed to model the divergence of two 
haplotypes (the PSMC model) as hidden states, yet considering the 
full set of genomes as observed states. In this approach, the transition 
probabilities of the coalescent HMM are similar to the PSMC (or to 
be more precise, similar to the MSMC with two haplotypes, as the 
original PSMC uses the SMC of McVean and Cardin [33] and not 
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5.1.4 Explicit 
Reconstruction of the 
Ancestral Recombination 
Graph 


5.2 The Case of 
Multiple Species 


the SMC’ of Marjoram and Wall [48 ]), but the emission probabil- 
ities are extended to account for the full site frequency spectrum of 
hundreds of genomes. This conditional site frequency spectrum 
(CSES) is computed using coalescence theory, offering a generali- 
zation of the Poisson random field (PRF) model introduced by 
Sawyer and Hartl [49]. Just like the original PRF, however, the 
CSFS ignores linkage of observed states, only linkage between the 
two conditioned haplotypes is modeled via the SMC. Additional 
data reduction steps are therefore required to ensure that the 
independence condition of sampled sites is met. 


While the ARG contains all historical information about a sample of 
genomes, genomes themselves contain very little information 
regarding the underlying ARG. As a result, in most statistical 
inference methods is the ARG treated as a variable accounted for, 
but not directly inferred. In the SMC models presented above, this 
is taken care of by the hidden Markov methodology, which com- 
putes a likelihood for a given sample by summing over all possible 
ARG (via the so-called forward algorithm). The Viterbi algorithm 
and the posterior decoding procedure are HMM algorithms that 
allow to reconstruct a posteriori the most likely ARG for a sample, 
such procedures are notably used for the inference of patterns of 
incomplete lineage sorting along genomes [11, 12, 50, 51]. Yet the 
variance in such estimation is typically very large [12]. 

Rasmussen et al. [20] proposed a different approach: they 
developed a Bayesian sampler of ARGs conditioned on a set of 
genome sequences. Similar in principle to the PAC and CSD 
approaches, the authors proposed to generate the ARG of 
n genomes conditioned on the ARG of n — 1 genomes, a proce- 
dure they refer to as threading. The generated ARGs can then be 
used to infer evolutionary processes of interest. Palacios et al. [52] 
developed a non-parametric method that allows to estimate the 
variation in time of the effective population size based on such 
reconstructed ARG. Rasmussen et al. further showed that while 
the model used for inference is purely neutral, the a posteriori 
inferred ARG contains signature of selection, visible for instance 
as a decrease of the time of the most common ancestor of two 
samples in the data close to coding sequences. Such approaches 
offer promising avenues for the development of new statistical 
methods to detect genomic regions with unusual history. 


Hobolth et al. [11] developed a hidden Markov model (HMM) to 
infer the ancestral recombination graph between three closely 
related species. Because this model only contains one haploid 
genome per species, it only allows to infer population parameters 
in the ancestral species. Dutheil et al. [12] reparametrized this 
model in the context of the sequentially Markov coalescent. In 
contrast to the previous approaches, only four hidden states were 
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Fig. 13 The coalescent process along genomes of three closely related species. (a) Four archetypes of 
coalescence scenarios with three species, exemplified with human, chimpanzee, and gorilla. In the first 
scenario, human and chimpanzee coalesce within the human—chimpanzee common ancestor. In the three 
other scenarios, all sequences coalesce within the common ancestor of all species, with probability 1/3 
depending on which two sequences coalesce first. (b) Example of genealogical changes along a piece of an 
alignment. The alignment was simulated using the true coalescent process and parameters corresponding to 
the human—chimpanzee-orangutan history. The blue line depicts the variation along the genome of the 
human-chimpanzee divergence. The background colors depict the change in topology, red and yellow 
corresponding to incomplete lineage sorting. Each change in color or break of the blue line is the result of 
a recombination event 


considered, corresponding to four alternative scenarios of lineage 
segregation (Fig. 13). In states 1 and 2, the genealogy is consistent 
with the phylogeny and lineages segregate in the same order as the 
species. In states 2, 3, and 4, allele divergence predates the first 
speciation event and ancestral polymorphism persists between the 
two speciation events, leading to incomplete lineage sorting. The 
scenarios depicted by states 2, 3 and 4 are equally likely, and in the 
case of states 3 and 4, the resulting topology is inconsistent with the 
phylogenetic tree. This model therefore does not rely directly on 
divergence variation along the genome alignment but uses patterns 
of topology variation instead to compute the speciation times and 
ancestral population sizes. 

Using this approach, Hobolth et al. estimated a speciation time 
between human and chimpanzee around 4.1 My and a large 
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ancestral effective population size of 60,000 for the human-chim- 
panzee ancestor. Dutheil et al. [12] found similar estimates with the 
same data set while accounting for substitution rate variation across 
sites and estimated an average recombination rate of 1.7 
cM/Mb. With sequencing of more great ape genomes, this 
approach allowed to estimate population size in several ape ances- 
tors ( [27, 50, 53], reviewed in [54]). As ILS is a proxy for ancestral 
effective population size, a major result of these studies is that the 
distribution of ILS is not uniform along the genome. For instance, 
it is reduced in proximity of genes, a pattern that can be explained 
by background selection [27, 50]. Large regions of the X chromo- 
some were also found to be devoid of ILS, a pattern resulting from 
recurrent selective sweeps along the chromosomes [55]. 


6 Specific Issues Faced When Dealing with Genomic Data 


6.1 Sequencing 
Errors and Rate 
Variation 


In previous sections we discussed population genetic models and 
methods for parameter estimation. We now describe several chal- 
lenges encountered when analyzing whole-genome data sets, at the 
intra- and interspecific levels. 


Sequencing errors are a well-described source of bias in population 
genetics analyses, resulting in an excess of singletons [56]. At both 
the intra- and interspecific/populational level, such error therefore 
leads to incorrect estimates of local divergence, in particular for 
recent times. When more divergent sequences are compared, for 
instance, from distinct species, the issue becomes more complex as 
the error rate differs between and within sequences due to coverage 
variation, but also properties of the genome (base composition, 
repeated elements, etc.). Such errors result in a departure from the 
molecular clock hypothesis, thus potentially leading to biases in 
parameter estimates, such as asymmetries in genealogy frequencies 
[57, 58]. In this respect, data preprocessing becomes a crucial step 
in any genomic analysis. Methods would also benefit in many cases 
of inclusion of a proper modeling of such errors. Burgess and Yang 
noticed that sequencing errors can be seen as a contemporary 
acceleration in external branches, resulting in an extra branch 
length [9]. Such an extra length can be easily accommodated in 
many models. It has to be noted that only a differential in error 
rates between lineages results in a departure from molecular clock, 
and in such approaches, one still has to consider that at least one 
sequence is error-free. In addition, as noted by the authors, assum- 
ing a constant error rate over all genomic positions may also turn 
out to be inappropriate, and better models should allow this rate to 
vary across the sequence. Such approaches still have to be explored. 
Moreover, sequencing errors are not distinguishable from lineage- 
specific acceleration (or deceleration in another species). In that 


6.2 Diploid Data and 
Phasing 
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Variation and Genome 
Alignment 
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respect, sequence quality scores can be a valuable source of infor- 
mation. They are currently used to preprocess the data by removing 
doubtful regions, but can ultimately be used in the modeling 
framework. 

The substitution rate also varies along the genome, which 
potentially affects the reconstruction of sequence genealogy, a phe- 
nomenon well known by phylogeneticists. In such case the tools 
developed for phylogenetic analysis can be applied with a reason- 
able cost. This generally consists in assuming a prior distribution of 
the site-specific rate and integrating the likelihood over all possible 
rates [8, 9, 12]. Alternatively, one can also use one or more out- 
group sequences to calibrate the rate, as in [6, 7]. 


While sequencing of diploid individuals allows to infer the two 
alleles present at heterozygous positions, establishing how these 
alleles are combined on each homologous chromosome requires an 
additional, error-prone step calling phasing. Analyses based on the 
comparison of individuals from distinct species do not require such 
information, as the coalescence time of two alleles from the same 
species is expected to have happened much after the speciation time 
of the compared species. In such case alleles at each heterozygous 
position can be sampled randomly [13] in order to build a com- 
posite haploid genome. The same rationale applies with respect to 
the use of the human reference genome, a composite genome 
obtained from multiple individuals. Conversely, inferences at the 
population level typically rely on the modeling of haploid genomes 
and therefore require phased data. A notable exception is the 
PSMC [14], as well as its extension SMC++ [19], which, when 
applied to one diploid individual, only requires the knowledge of 
the position of heterozygous positions. 


Genome data are intrinsically fragmented, firstly because of chro- 
mosomal organization, but also because of rearrangements that 
prevent molecule-to-molecule alignment from one species to 
another. A genome data set is therefore a set of distinct alignments, 
one per synteny block. Synteny information can only be extracted 
when individual genomes are available, which is typically not the 
case for most “re-sequencing” data sets. At the population level, 
however, such large-scale variation is considered negligible (but 
see, for instance, [59] for an exception), while it becomes more 
prominent when genomes from distinct species are compared. In 
such cases, a genome alignment is constructed with potential errors 
ultimately leading to the comparison of nonhomologous regions. 
So far, the only way to deal with such errors is to restrict the analysis 
on regions where orthology can be unambiguous resolved, mostly 
by removing short synteny blocks and regions that contain a high 
proportion of repeated elements, gaps, and duplications. 


584 Julien Y. Dutheil and Asger Hobolth 


7 Discussion 


Studying the speciation process with genome data implies new 
modeling challenges, as the basic configuration of a population 
genetics data set is drastically changed: instead of having a few 
loci sequenced in several individuals, we have an (almost) exhaus- 
tive set of loci sequenced in several individuals for multiple closely 
related species. The change involves the spatial dimension, but also 
time, as the process under study occurred much further back in 
time than the ones that are commonly studied with a “standard” 
population genetics data set. The use of the spatial signal has a 
major consequence, namely, that recombination has to be taken 
into account, even if it is not directly modeled. 

Apart from these considerations, ancestral population geno- 
mics, as population genetics, heavily relies on the study of sequence 
genealogy, its shape, but also its variation. The underlying models 
build on existing intraspecies population modeling, as they only 
need to add the species divergence process, that is, a moment in 
time where two populations stop exchanging genetic material and 
evolve fully independently. The simplest isolation model assumes 
that the speciation is instantaneous, while the isolation-with-migra- 
tion model assumes that the two neo-species can still exchange 
some material, at least for a certain time after the split. Such a 
model is not different from a pure isolation model where the 
ancestral population is structured into two subpopulations: in the 
first case the speciation time is defined as the time of the split, while 
in the second case it is the time of the last genetic exchange. Recent 
work on primates [10] suggests that the speciation of human and 
chimpanzee was not instantaneous. If the average divergence of the 
human and chimpanzee is a bit more than 6 My (using widely 
accepted mutation rate), then the split of the two species initiated 
around 5.5 My ago, and the last genetic exchange can be dated 
around 4 My. 

The fact that we sample a large number of positions in the 
genome thus appears to have the power to counterbalance the 
reduced sampling of individuals within population, allowing 
the estimation of demographic parameters in the ancestor. None- 
theless, complexity limits are rapidly reached, when considering, for 
example, three closely related species that can exchange migrants. 
More complex demographic scenarios, incorporating, for instance, 
variation in population sizes, will also add additional parameters 
that might not all be identifiable. 

If the ancient speciation processes have left signatures in the 
contemporary genomes, we do not know yet how far back in time 
this is true. Intuitively, the signal is maximal when the variation in 
divergence due to polymorphism is large enough compared to the 
total divergence. The divergence due to polymorphism is 
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proportional to the ancestral population size, while the divergence 
of species is only dependent on the time when it happened. So the 
further back in time we are looking at, the bigger the population 
sizes need to be so that the ancient polymorphism leaves a signature 
in the total divergence time. In addition to this, one has to take into 
consideration sequence saturation due to the too large number of 
substitutions that accumulated since ancient splits, and the fact that 
demographic scenarios complexity increases with time. For 
instance, when considering the evolution of a species over several 
millions of generations, the probability that a bottleneck, resetting 
the signal from past events, occurred once is not negligible. 

We are in the population genomics era. Data sets are available 
that allow us to understand the evolutionary processes that are 
associated with the formation and evolution of species. Analyzing 
such data sets with the current methodologies however offers major 
challenges: (1) developing the appropriate computational tools able 
to handle such data sets with current machines (both in terms of 
processor speed and memory usage) and (2) design realistic models 
with enough complexity to capture the most important historical 
events while remaining computationally tractable. 


Assuming that there are 5 My between the speciation times of 
human with the gorilla and the orangutan, that the HG ancestral 
effective population size was 50,000, what is the expected amount 
of ILS between human, gorilla, and orangutan? Assuming that 
another 2.5 My separates the speciations of human with chimpan- 
zee and gorilla, with an HC effective ancestral population size of 
50,000, what is the expected amount of ILS between human, 
chimpanzee, and orangutan? We assume a generation time of 
20 years for all extent and ancestral primates. 


Given that 30% of incomplete lineage sorting is observed between 
human, chimpanzee, and gorilla and assuming a generation time of 
20 years and a that 2.5 My separate the splits between human/ 
chimpanzee and human—chimpanzee/gorilla, what is the effective 
ancestral population size compatible with this observed amount? 
Using Burgess and Yang’s method [9], a researcher finds a higher 
estimate of Ne than expected. What could explain this discrepancy? 


In this exercise we show that a k-population IM model has 
2(k — 1)? migration rates. 


l. Starting at the bottom of the k-population IM model argue 


that the number of migration rates at the level of k populations 
is k(k — 1). 
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2. Moving up to the next level where (k — 1) populations are 
present (one of them being an ancestral population, we assume 
that there two speciation events are never simultaneous) argue 
that the new ancestral population introduces 2(k — 1) new 
migration rates. 


3. Moving up yet another level where (k — 2) populations are 
present argue that the new ancestral population introduces 2 
(k — 2) new migration rates. 


4. Show that the total number of migration rates is 2(k — 1)’. 
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Introduction to the Analysis of Environmental Sequences: 
Metagenomics with MEGAN 
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Abstract 


Metagenomics has become a part of the standard toolkit for scientists interested in studying microbes in the 
environment. Compared to 16S rDNA sequencing, which allows coarse taxonomic profiling of samples, 
shotgun metagenomic sequencing provides a more detailed analysis of the taxonomic and functional 
content of samples. Long read technologies, such as developed by Pacific Biosciences or Oxford Nanopore, 
produce much longer stretches of informative sequence, greatly simplifying the difficult and time- 
consuming process of metagenomic assembly. MEGAN6 provides a wide range of analysis and visualization 
methods for the analysis of short and long read metagenomic data. A simple and efficient analysis pipeline 
for metagenomic analysis consists of the DIAMOND alignment tool on short reads, or the LAST alignment 
tool on long reads, followed by MEGAN. This approach performs taxonomic and functional abundance 
analysis, supports comparative analysis of large-scale experiments, and allows one to involve experimental 
metadata in the analysis. 


Key words Metagenomics, Software, MEGAN, Taxonomic analysis, Functional analysis, Long reads 


1 Introduction 


Metagenomics is the study of microbiome samples, such as 
obtained from ocean water, soil, plant matter, or feces, say, using 
high-throughput DNA sequencing [1]. Metagenomic sequencing 
allows the study of microorganisms found in environmental sam- 
ples without relying on culturing methods or prior knowledge of 
the composition of the community. With metagenomics, one can 
determine the taxonomic and functional content of samples. 
While most metagenomic projects to date have used short read 
sequencing (next-generation sequencing), there is increasing inter- 
est in using long read sequencing technologies in this area. Long 
read technologies have been considered too expensive, difficult, or 
error-prone for application in metagenomics. However, this is 
changing and computational analysis methods designed for proces- 
sing short reads now need to be modified to work well on long 
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reads, so as to make good use of the ability of long reads to cover 
multiple genes. 

A major computational challenge in metagenomics is the align- 
ment of sequencing reads against a comprehensive reference data- 
base. Billions of reads can be aligned against a large protein 
reference database in reasonable time using high-throughput align- 
ment tools such as DIAMOND [2]. Long reads require frame-shift 
aware alignment tools, such as LAST [3, 4], because insertions or 
deletions due to sequencing errors impact long reads, as discussed 
in Subheading 2. 

In the following, we will first discuss how to perform basic 
alignment and analysis of short reads in Subheading 2.1 and long 
reads in Subheading 2.2. We will then show, in Subheading 3, how 
to compare large numbers of samples in MEGANG [5] and perform 
basic statistical analysis of the samples and their metadata. In Sub- 
heading 4 we briefly discuss the challenges we will have to face to 
further improve the analysis of data from environmental samples. 
Finally, in Subheading 4.1 we describe some additional resources 
available for using MEGAN 6. 


2 Workflows for Metagenomic Analysis with MEGAN 


2.1 Short Read 
Pipeline 


The basic workflow for using MEGAN consists of two main steps: 
read alignment against a reference database and then import an 
analysis of the alignments in MEGAN. The aim of pipeline is to 
perform taxonomic and functional binning of the input reads. 

The alignment can be performed using a number of different 
tools depending on the type of sequencing data and on the chosen 
database, its sequence type, size, and available computer power. For 
smaller databases more sensitive tools can be chosen such as MALT 
[6] or even BLAST [7]. These tools generally offer higher sensitiv- 
ity at the cost of a longer runtime. For large datasets and databases, 
it is more suitable to choose an alignment tool such as DIAMOND 
or LAST. We use the NCBI NR database [8 ] with both of the latter 
tools, because it is the largest and most comprehensive protein 
database available today. NCBI NR contains 144.5 million protein 
sequences (August 2017). 


We describe here the basic short read analysis pipeline as shown in 
Fig. 1. By default, we use DIAMOND to align reads against the full 
NCBI NR database. 

Before running the pipeline, one can optionally perform 
preprocessing, that is, quality control, trimming, and filtering, of 
the raw reads. However, these steps usually have little impact on 
the results of the alignment-based analysis described in this 
document. 
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Fig. 1 Basic pipeline for short read analysis 


2.1.1 Read Alignment 
with DIAMOND 


2.1.2 Taxonomic and 
Functional Classification 
with MEGAN6 


DIAMOND uses double indexed alignment, which means both the 
reference database and the query are indexed for comparison. This 
leads to a large speedup especially for large queries and databases. 
Like BLASTX, DIAMOND uses the “seed and extend” method to 
find all matches between a query and the database. To further 
increase speed, DIAMOND utilizes spaced seeds, which are long 
seeds where only some positions are used for matching the seed. 
This leads to another increase of speed without decreasing 
sensitivity. 

DIAMOND can be run either in fast or sensitive mode. Fast 
mode will run around 20,000 times faster than BLASTX on short 
reads and will be able to find 75-90% of all relevant matches that 
one would find with BLASTX, while sensitive mode provides a 
speedup of 2500x while recovering up to 94% of significant 
matches. 


DIAMOND can save alignments in a compressed format called 
DAA (DIAMOND alignment archive) format. DAA files can be 
imported into MEGANG6 in multiple ways. A small number of small 
DAA files can easily be imported interactively using menu items 
provided in MEGAN. For larger datasets and or many files, one 
should use the command-line tools provided with MEGAN. These 
include daa2rma, which will generate a RMA file as used by 
MEGAN from one or two (for paired reads) DIAMOND files 
and daa-meganizer, which analyzes a DAA file and then appends 
the result to the end of the file. Such “meganized” DAA files can 
then be opened directly in MEGAN. The latter approach is much 
faster and is more space efficient. However, to use paired reads all 
alignments have to be in the same file. 

One can use the program blast2rma to process the output of 
a range of different alignment programs, such as BLAST. 

During the processing of alignments for MEGAN, the reads 
will be assigned to nodes in the NCBI taxonomy and any functional 
classifications that have been configured in the import dialog or on 
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the command-line. Taxonomic binning of each read is done sepa- 
rately, by assigning it to the lowest common ancestor (LCA) of its 
significant matches. Matches can be filtered by multiple parameters, 
for example, e-value and bit-score, as well as sequence identity. 
Only matches passing those filters will be used to determine the 
LCA. It is also important to choose the minimum support 
(or minimum support percentage), the number or percentage of 
reads that must be assigned to a single taxon before it will be part of 
the final result. Reads assigned to a taxon that does not pass the 
minimum support filter will be pushed up the taxonomy until a 
taxon is found that passes the filter. 

Functional binning is performed by mapping the NCBI data- 
base accessions for the matches of a read to identifiers of the 
selected functional classification. Mapping files are currently avail- 
able for InterPro2GO [9, 10] (InterPro families embedded in a 
GO-based hierarchy), eggNOG [11], KEGG [12], and 
SEED [13]. 


2.1.3 Investigation of the The resulting files can be opened and interactively investigated 
Results using the MEGAN6 graphical user interface. The first view when 
opening a file is always a hierarchical representation of the taxo- 
nomic composition of the sample. Selecting different nodes of this 
tree, the user can uncover further information on the reads mapped 
to the represented taxon. Selecting Inspect Reads ona node will 
open the Inspector Window, which displays the reads assigned to 
that node, as well as their alignments. This functionality can be used 
both in the Taxonomy Viewer, where nodes represent taxa, and in 
any of the Functional Viewers. Figure 2a shows an example of the 
Inspector Window. 
Instead of just viewing a listing of the matches and alignments, 
it is also possible to select Show Alignments. This will open the 
Alignment Viewer (Fig. 2b), where for each of the database refer- 
ences with matches from the reads assigned to the selected node it is 
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Fig. 2 (a) The Inspector Viewer showing some reads that have been assigned to Alistipes ihumii. (b) The 
Alignment Viewer showing reads aligned to a reference sequence 
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Fig. 3 (a) Bar chart of taxonomic assignments on family level, sorted by abundance. (b) Radial chart of 
functional assignments to KEGG for the same sample from [14] 


possible to show the alignment of all of those reads on the refer- 
ence. This can be useful, say, to determine how much ofa reference 
gene is covered by reads. 

Apart from being able to investigating taxonomic diversity, the 
advantage of using metagenomic sequencing to study an environ- 
mental sample is the ability to study the functional potential of the 
community. MEGAN currently provides four different functional 
classification systems for this purpose: InterPro & GO, eggNOG, 
KEGG, and SEED. 

Each functional classification is displayed as a tree. The nodes of 
the tree can be investigated very much like the nodes of the taxo- 
nomic tree. Abundances can be visualized using different visualiza- 
tion options from simple bar charts over box plots and heat maps to 
radial tree charts drawn based on the abundances of the selected 
nodes. Two examples show charts that are shown in Fig. 3. 

Alignments or reads matching a selected function can be 
exported to a text file or extracted to a new MEGAN document. 
This makes it possible to study only a part of a microbial commu- 
nity that is of particular interest. For example, if you select nodes 
associated with antibiotic resistance genes, you can determine 
which taxonomic assignment the reads assigned to antibiotic resis- 
tance genes have. An example of this is shown in Fig. 4. 

If you want to study the full gene sequence of proteins found in 
your samples and be able to compare variants of those genes, it can 
be helpful to use gene-centric assembly [15]. Gene-centric assem- 
bly uses the alignments to reference proteins to assemble the 
matching reads. One can thus obtain the gene sequences from 
different organisms found in a sample for further analysis steps. 

We will introduce more possibilities for studying the taxonomic 
and functional diversity of multiple samples in comparison in 
Subheading 3. 
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Fig. 4 Taxonomic assignment of reads from the day 0 sample for “Alice” from the ASARI [14] dataset which 
have been assigned to “resistance of fluoroquinolones” in the SEED hierarchy 


2.2 Long Read As presented in the previous section, using metagenomic short reads, 
Pipeline one can assembly gene sequences and obtain variants ofa single gene 
using a gene-centric assembly, or of course use other assembly tech- 
niques. However, using short read data, it is very difficult to establish 
whether different genes are present in the same organism. We can 
connect the genes if they are found on a single DNA molecule with 
long sequencing reads, provided by third generation sequencing 
technologies such as PacBio [16] or Oxford Nanopore [17]. 
The PacBio and Nanopore devices can produce reads that are 
hundreds of thousands of bases long, with error rates of around 
10%, say [17]. In contrast to short reads, which each can be safely 
assumed to overlap with only a single gene, long read will usually 
overlap or contain multiple genes. Hence, many popular short read 
alignment and analysis algorithms may require modification so as to 
take into account that a given read can align to multiple genes. 


2.2.1 Long Read Analysis ‘The basic long read analysis pipeline is analogous to the above 

Pipeline described short read pipeline, and consists of the alignment and 
MEGAN analysis steps (Fig 5), but the details of the analysis 
pipeline as well as some components of MEGAN6 differ from the 
short read solution. 

As described in the following, for long reads alignment is 
performed using LAST, processing of the alignments requires an 
additional step and MEGAN provides some modified algorithms 
for processing and visualizing long reads. 
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Fig. 5 Basic pipeline for long read analysis 
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Fig. 6 A frame-shift aware DNA-to-protein alignment produced by LAST 


2.2.2 Alignment Using 
LAST 


2.2.3 Taxonomic and 
Functional Classification of 
Long Reads 


Third generation sequencing technologies produce much longer 
reads, with a higher error rate (approximately 10%, mostly inser- 
tions and deletions). Most DNA-to-protein aligners (such as 
BLASTX [7] or DIAMOND) translate the complete DNA query 
sequence in all six reading frames and then align the translated 
sequences against the protein database. Insertions or deletions in 
long reads cause a frame-shift and break translation-based align- 
ments. LAST is a frame-shift aware aligner that incorporates single- 
base insertions or deletions into the alignment calculation. These 
are represented as “\” for forward-shifts and ”/” for reverse-shifts, 
as shown in Fig. 6. 

LAST, when used with large databases, such as NCBI-nr, splits 
the database into several volumes and indexes them individually. 
Similarly the large input files are loaded in separate volumes, and 
each volume of input is searched against each volume of the data- 
base. LAST, by default, generates output in MAF, “Multiple Align- 
ment Format.” 


Because of processing both the query and database in different 
volumes and writing the output as soon as it is generated, the 
alignments for a single read appear in different parts of the MAF 
output of LAST. MEGAN processes alignment files line-by-line, 
identifies all alignments of a single read, and then assigns that read 
to a taxonomic and/or functional class. The unordered structure of 
LAST output prevents MEGAN from doing this. Thus, MAF files 
produced by LAST must be sorted before they are imported to 
MEGAN. For this task, MEGAN provides a command-line script, 
called sort-last-maf: 
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2.2.4 Investigation of the 
Results 


Alternatively, the user can use DAA_Converter (available at 
http://github.com/BenjaminAlbrecht84/DAA_Converter), which 
converts a given MAF file to a DAA file. This has several advantages, 
including space compression and faster processing. Additionally, 
the output of LAST can directly be piped into DAA_Converter 
which will then convert the output into a DAA file as LAST con- 
tinues to operate. The trade-off when using DAA_Converter cur- 
rently is that the alignments are filtered out with the default settings 
in MEGAN6 and resulting DAA file only has the alignments that 
would pass the filter, making it impossible to change filtration 
parameters without running LAST again once the conversion 
is done. 

Similar to short reads, these long read MAF and DAA can then 
be imported into MEGAN and each read will get assigned to a 
taxon and/or functional class(es) of any provided functional hier- 
archy. The filtration based on bit-score of alignments work differ- 
ently for long reads. In case of short reads, the alignments are 
filtered globally—only those that are within top 10% (by default) 
of the best-scoring alignment are taken into account. For long 
reads, this filtration is applied to each “gene” separately, as one 
long read can contain many different genes along its length. The 
alignments that overlap significantly (>90% by default) are grouped 
into segments, denoting different genes, and each interval is then 
processed individually in the filtering step. 

The LCA algorithm to assign reads to taxonomic classes is also 
modified for long reads. As there are multiple genes on a single long 
read, and each of them may be conserved in different clades of the 
taxonomic tree, the naive LCA is usually uninformative. Instead 
long reads are assigned to the most specific taxon that covers more 
than a fixed percentage (>80% by default) of every base pair that 
has an alignment. This algorithm assigns reads specifically to lower 
levels of taxonomy as long as they cover a gene which has low level 
conservation, other taxa gets lower percentages of coverage. Func- 
tional classification of long reads does not necessarily assign each 
read into one functional class, instead reads are assigned to the 
functional class of best-scoring alignment in each segment, thus 
each segment is assigned to one function and one read can be 
assigned to multiple different functional classes. 


The first view the user gets when a long read dataset is loaded in to 
MEGAN6 is identical to that of a short read dataset; however, there 
are some underlying differences and several investigation modes 
designed specifically for long reads. 

Due to a large variability of read length of long reads [18], it is 
impractical for MEGAN to report number of reads assigned to class 
as a mean of abundance. Using the raw read length is also not 
feasible for Nanopore technology as reads tend to have “head” 
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Fig. 7 Long Read Inspector in MEGAN6. The read is drawn as a line in the middle and the protein alignments 
are drawn as arrows on their corresponding positions and strands on the read 


and “tail” regions composed of random bases [19] (Fig. 7 shows a 
read whose tail region has no significant alignment to any protein in 
the database). Thus, the default mean of reporting the abundance 
for a particular taxon or functional class in long read pipeline is the 
number of aligned bases. 

The number of alignments on a long read can easily exceed 
hundreds and complicates the Alignment Viewer and the Inspector 
features of MEGANO. In order to simplify the investigation of 
alignments on the reads, MEGAN6 offers a Long Read Inspector 
window (Fig. 7), accessible via right-click on any of the nodes in 
the main view. This inspector draws reads as horizontal lines and 
alignments as arrows on their corresponding positions. The names 
of taxa or functional classes are also linked to these alignment 
arrows. 

The Inspector Window helps particularly in the case of suspi- 
cious assignments. Figure 8a shows the inspector view for a read 
that was assigned to Trichuris trichiura, a human parasitic whip- 
worm, in a sample of known mixture of microorganisms [20]. A 
closer inspection to Fig. 8a lets us see that, although the read is 
spanned by several alignments from Escherichia coli, it is assigned to 
T. trichiura because the total length of alignments to T. trichiura is 
longer than 80% whereas it is below that for E. coli and all other 
competing taxa. 

For further analysis of such suspicious assignments, MEGAN6 
offers a remote BLAST function, in which selected reads are aligned 
against a selected database (such as the nucleotide collection— 
NCBI nt) on the NCBI website and the resulting assignments are 
captured, processed, and presented in a new MEGAN document. 
In Fig. 8b, we see that our “suspicious” read is assigned to E coli, 
which was in the known mixture of microorganisms, based on 
remote NCBI-BLAST against NCBI nt. 

Similar to exporting alignments and reads as explained in the 
previous section, these can also be exported in general feature 
format (GFF) for downstream analysis. This provides a simple way 
of obtaining the annotation, especially for long reads and contigs. 
The annotations exported to the GFF files contain the accessions of 
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Fig. 8 MEGANG offers a remote BLAST functionality, namely “BLAST on NCBI,” which can be used for 
suspicious assignments. (a) Long Read Inspector view for a read assigned to Trichuris trichiura, based on 
protein alignments against NCBI nr. (b) Long Read Inspector view for the same read as in (a), assigned to 
Escherichia coli, after searching it against nucleotide collection of NCBI using the remote BLAST functionality 
of MEGAN6 


references and their corresponding taxonomic and/or functional 
mappings depending on which mapping files were used during 
importing the dataset into MEGAN. 


3 Comparison of Multiple Samples 


Most modern metagenomics experiments include the collection 
and analysis of multiple samples to compare different groups with 
controls or study the dynamic changes of a microbial community 
over time. Hence, a very important feature of MEGAN is the ability 
to load multiple datasets into a single “comparison document” 
(megan file). This is a light-weight file that does not contain the 
original reads and alignments, but allows one to compare the 
taxonomic and functional diversity of multiple samples. 

To be able to easily compare groups of samples and relate 
findings to features attached to samples, it is helpful to import 
metadata. Metadata should be provided in tabular format and 
connect the sample IDs to attributes whose values can be text, 
numeric, or boolean values. Using this information you can 
group samples in different visualizations. For example, this allows 
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easier interpretation of the principal component analysis (PCoA) 
plots in MEGAN. Principal components can be calculated using 
different distance measures including Bray—Curtis or simple Euclid- 
ean distances. MEGAN can include bi-plots and tri-plot vectors 
into the PCoA plot, which represent the top taxonomic or func- 
tional classes and metadata features, respectively, that correlate 
most with the differences between samples. Figure 9 shows multi- 
ple examples of PCoA plots including bi-plot and tri-plot vectors. 
MEGAN can also calculate and visualize co-occurrence and 
correlation plots. For correlation there are two options. The first 
is useful for time series analysis, because it calculates correlations 
between different taxa. This can be used to determine how changes 
in abundance of one taxon influence changes in another, which 
makes it possible to detect potential interactions between taxa. To 
distinguish the effect of interactions between taxa from it being 
caused by an external influence, it is useful to check out the other 
attribute correlation plot, which calculates correlations between 
taxa and metadata. So, if, for example, two taxa are correlated to 
each other and correlated to the same external influence from the 
metadata, then they might be less likely to be influencing each 
other, but are perhaps both influenced by the same attribute of 
the metadata. An example of an attribute correlation plot is shown 
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Fig. 9 PCoA analysis of 12 samples associated with “Alice” (round shapes) and “Bob” (square shapes), from 
[14]. Time points of antibiotic intake are colored light blue, time points before and after antibiotic intake dark 
red. (a) A PCoA plot based on Bray—Curtis distances as calculated by MEGAN using the taxonomic abundances 
for the samples. The green vectors represent the bi-plot vectors. The samples are grouped by individual, 
showing the convex hulls of the groups as well as ellipses. (b) is based on the same data but using the 
abundances of GO terms in the InterPro2GO hierarchy and only showing the convex hulls of the group. Here the 
orange vectors are the tri-plot vectors, showing the relation of metadata values to the principal components 
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Fig. 10 Attribute correlation plot for the data from [14] for two healthy individuals taking antibiotics for 6 days 
(day 1-6). Correlation is shown as a heat map with red marking positive correlation between the attribute and 
the taxon and blue marking negative correlation. Correlations are shown for antibiotics intake (boolean) and 
time (day 0, 1, 3, 6, 8, and 34) 


4 Outlook 


It goes without saying that the quality and quantity of the input 
sequencing data limits the reliability of the output analysis. More 
directly, quality of the MEGAN hierarchy assignments is deter- 
mined by the quality of the read alignment, which, in turn, depends 
on the chosen database and alignment tool. On the one hand, the 
database needs to be well annotated and comprehensive, as it is only 
possible to analyze the organisms or entities present in it. On the 
other hand, the alignment tool needs to be sensitive in order to 
identify the matching sequence. It is especially difficult to deal with 
sets of very similar sequences. Currently, for the human gut micro- 
biome sequencing data analyzed with the basic short read pipeline, 
as much as 30% of reads are not assigned to any node in the course 
of the taxonomic analysis. 

In order to avoid the bias introduced by the database one can 
also use one of the database-free strategies, e.g., k-mer counting. 
They are good for tracking the global changes in the data, but it is 
difficult to correct for possible contaminations. Although MEGAN 
does not support this type of analysis, it enables global comparisons 
with PCoA based on the profiles computed for each of the samples. 
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4.1 MEGAN 
Resources 


References 


Another approach is assembly based analysis. In brief, the reads 
are assembled and then the scaffolds or contigs are annotated and 
investigated. This approach provides some information on gene 
co-localization at a cost of data loss in the form of unassembled 
reads and short contigs. Full metagenomic read assembly [21] is a 
very complex and computationally expensive task that MEGAN 
does not address. 

Application of the long read sequencing technologies opens 
new perspective for metagenomics analysis. Long reads provide 
information on gene co-location on a single DNA molecule, and 
make assembly much easier. But, long reads also pose new algorith- 
mic challenges in aspects of the protein alignment, hierarchy assign- 
ment, and abundance computation. As long read technologies 
continue to evolve, so, too, must the corresponding analysis 
algorithms. 

MEGAN is a powerful visual analytics tool that provides a wide 
range of the algorithms for analysis of metagenomics sequencing 
data. MEGAN can run on hundreds of samples along with 
hundreds of metadata columns. It is the main workhorse of the 
Tubiom project where metagenomics profiles of 10,000 volunteers 
are collected and mined for correlations with the vast metadata 
(www.tuebiom.de). 


MEGAN Community software is freely available on the website: ab. 
inf.uni-tuebingen.de/data/software/megan6, together with the 
current mapping files for taxonomic and functional analysis. 

Short read datasets presented in this chapter and used for 
visualizations are publicly accessible in MEGAN via MeganServer. 
The dataset used in the Long Read Pipeline section was down- 
loaded from the supplementary material of Brown et al. 
[20]. Instructions for use of MEGAN and user support can be 
found on the MEGAN community website (megan.informatik. 
uni-tuebingen.de). 
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Abstract 


Metagenomics, also known as environmental genomics, is the study of the genomic content of a sample of 
organisms (microbes) obtained from a common habitat. Metagenomics and other “omics” disciplines have 
captured the attention of researchers for several decades. The effect of microbes in our body is a relevant 
concern for health studies. There are plenty of studies using metagenomics which examine microorganisms 
that inhabit niches in the human body, sometimes causing disease, and are often correlated with multiple 
treatment conditions. No matter from which environment it comes, the analyses are often aimed at 
determining either the presence or absence of specific species of interest in a given metagenome or 
comparing the biological diversity and the functional activity of a wider range of microorganisms within 
their communities. The importance increases for comparison within different environments such as 
multiple patients with different conditions, multiple drugs, and multiple time points of same treatment 
or same patient. Thus, no matter how many hypotheses we have, we need a good understanding of 
genomics, bioinformatics, and statistics to work together to analyze and interpret these datasets in a 
meaningful way. This chapter provides an overview of different data analyses and statistical approaches 
(with example scenarios) to analyze metagenomics samples from different medical projects or clinical trials. 
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1 Introduction 


The diversity of species on earth is high, and most of them are 
microorganisms. Their ubiquitous presence makes it extremely 
difficult to identify and classify all microbes in a laboratory environ- 
ment. Standard genomics tries to enrich pure cultures and study 
them: for example, the taxonomy, the genome, the genes, and the 
pathways. However, only a miniscule fraction of all microbes can be 
cultured because of their complex symbiosis and nutrient require- 
ments in other organisms. The scientific community is now 
equipped with the development of new sequencing techniques 
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and high-throughput analysis. The study of the genomic content of 
a sample of microorganisms obtained from a common habitat is 
made possible with the field of metagenomics, also known as envi- 
ronmental genomics [1 ]. Instead of taking the DNA for sequencing 
from isolated cultures it is obtained directly from the environment. 
Therefore, the analysis of microbes that are deemed unculturable 
(which means current laboratory culturing techniques are unable to 
grow them) with standard laboratory techniques becomes possible. 
Two main approaches commonly used in metagenomic studies: 
marker gene-based metagenomics (e.g., 16S amplicon sequencing) 
and metagenomic shotgun sequencing. In the first approach, DNA is 
used as the template for PCR to amplify a segment of the conserved 
16S ribosomal RNA (rRNA) gene sequence. Universal primers 
complementary to conserved regions are used so that the region 
can be amplified from any bacteria. After purification of PCR 
products, sequencing of the 16S rRNA gene is performed [2]. In 
the second approach, shotgun sequencing, DNA is broken up 
randomly into multiple small segments, which are sequenced 
using the chain termination method to obtain reads. Multiple over- 
lapping reads for the target DNA are obtained by performing 
several rounds of this fragmentation and sequencing. Computer 
programs then use the overlapping ends of different reads to assem- 
ble them into a continuous sequence [3 ].There are several publica- 
tions discussing the differences in microbial biodiversity discovery 
between 16S amplicon and shotgun sequencing, for example see 
[4]. In a recent study using water samples from Brazil’s major river 
floodplain systems, authors showed shotgun sequencing outdone 
by amplicon [5]. Here, the authors ascribed the poor performance 
of shotgun sequencing mainly to the weakness of the database used 
in the study, as compared to databases for the 16S rRNA gene. This 
study can be used as a caution for people working with rare envir- 
onments (See article by Catherine Offord in The Scientist'). Com- 
parisons of the two methods in well-studied systems such as the gut 
microbiome have generally found that shotgun sequencing identi- 
fies more microbial diversity [6]. 

Further recent advancement of culturomics approach is shed- 
ding light on multiple high-throughput culture conditions 
[7, 8]. As the samples used in metagenomics do not contain the 
genome of just one but many different microorganisms, the possi- 
bility of analyzing their functional and metabolic interplay arises. 
Next-generation sequencing technology (NGS) has effectively 
transformed infectious disease research throughout the last decade, 
fuelling the growth in genetic data and providing huge number of 
DNA reads at an affordable cost. Many studies use these 
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techniques, which examine microorganisms that inhabit niches in 
the human body, sometimes causing disease, and researchers often 
try to correlate these microorganisms and their change with multi- 
ple treatment conditions (e.g., see [9]). Gene annotations in these 
studies support the association of specific genes or metabolic path- 
ways with health and with specific diseases. In a recent article 
authors discussed how host gene—microbial interactions are major 
determinants for the development of multifactorial chronic disor- 
ders and thus for the relationship between genotype and phenotype 
[10]. There are many other reports based on the application of 
metagenomics in understanding oral health and disease 
[11-13]. As recently described by Forbes et al., metagenomics 
and other “omics” disciplines could provide the solution to a cul- 
tureless future in clinical microbiology, food safety, and public 
health [14]. 

No matter from which environment it comes, the analysis of 
datasets from such studies are similar to some extent. Most projects 
aim at determining either the presence or absence of specific species 
of interest, or to obtain an overview of the taxa represented in a 
given metagenome and comparing the biological diversity and the 
functional activity of a wider range of microorganisms within their 
communities. The importance increases for comparison of different 
datasets, as researchers will need to determine and understand the 
similarities and dissimilarities within the metagenomes of different 
environments. These environments can be multiple patients with 
different conditions, multiple drugs, or multiple time points of 
same treatment or same patient. Further, sometimes researchers 
also may compare different environments for example to study 
antibiotic resistance genes (ARG) and understand which environ- 
ments are more prone to such ARGs. Thus, no matter how many 
hypotheses we have, we need a good understanding of genomics, 
bioinformatics, and statistics to work together to analyze and inter- 
pret these datasets in a meaningful way. 

This chapter provides an overview of different data analyses and 
statistical approaches to analyze metagenomics samples from a 
number of clinically derived datasets. The methodological descrip- 
tion of this chapter will be guided by three main scenarios. The first 
one is a published data set from human atherosclerotic plaque 
samples (Scenario 1) [15]; the second one is a clinical trial example 
comparing the effects of two omega-3 polyunsaturated fatty acids 
(PUFAs) supplements on healthy volunteers (Scenario 2) [16]; and 
the third one is another clinical trial example comparing the efficacy 
of two drugs for an infectious disease (Scenario 3). 

The Scenarios 3 came from an ongoing unpublished project; 
therefore, the real datasets are not provided. This chapter is mainly 
focused on multiple data analyses/annotation and statistical 
approaches that can be used in similar situations, but any biological 
finding of the example scenarios is not explained here. Although all 


608 Suparna Mitra 


of these scenarios are derived from medical projects, the analyses 
approach can be adapted to environmental samples as well. On this 
occasion, I must emphasize the importance to have good metadata, 
that is, a detailed description of each parameter like health status or 
sampling site or age or any similar information relating to specific 
samples that may be important for the analyses. Good metadata are 
key to good analyses and noise reduction in data analysis processes. 


2 Description of Example Studies 


2.1 Scenario 1: To investigate microbiome diversity within human atherosclerotic 
Metagenomic tissue samples high-throughput metagenomic analysis was 
Analyses of Human employed on (1) atherosclerotic plaques obtained from a group of 
Atherosclerotic Plaque patients who underwent endarterectomy due to recent transient 
Samples cerebral ischemia or stroke and (2) presumed stabile atherosclerotic 


plaques obtained from autopsy from a control group of patients 
who all died from causes not related to cardiovascular disease. Our 
data provides evidence that suggest a wide range of microbial 
agents in atherosclerotic plaques, and an intriguing new observa- 
tion that shows this microbiota displayed differences between 
symptomatic and asymptomatic plaques, as judged from the taxo- 
nomic profiles in these two groups of patients. Additionally, func- 
tional annotations reveal significant differences in basic metabolic 
and disease pathway signatures between these groups. 

In this project, we demonstrate the feasibility of novel high- 
resolution techniques aimed at identification and characterization 
of microbial genomes in human atherosclerotic tissue samples. Our 
analysis suggests that distinct groups of microbial agents might play 
different roles during the development of atherosclerotic plaques. 
These findings may serve as a reference point for future studies in 
this area of research. The workflow in Fig. 1 provides a brief 
description of the sample processing and analyses pipeline for the 
study described in Scenario 1. If readers want to know more details 
of the methodology, please refer to (15). This scenario is an exam- 
ple of analyzing host-associated metagenome samples. 


2.1.1 Methodology For this study, we used atherosclerotic tissue samples from a group 
Details of 15 patients that underwent elective carotid endarterectomy 
following repeated transient ischemic attacks or minor strokes 
(samples from symptomatic atherosclerotic plaques as cases).” Fur- 
ther, we have asymptomatic atherosclerotic plaques from seven 


? All methods and experimental manuals were approved by The National Committee on Health Research Ethics 
(Danish) and was granted by the Ethical Committee of the region of Copenhagen (H-3-2011-013). 
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15 patients that underwent 
elective carotid 
endarterectomy following 
repeated transient is chemic 
attacks or minor strokes 


samples from 

symptomatic 
atherosclerotic 
plaques as cases 


we have asymptomatic 
atherosclerotic plaques from 
7 persons who died from 
causes not related to athero- 
sclerotic disease 


samples from 
stable plaques as 
controls 


DNA was extracted 
using QIAGEN’s DNeasy 
Blood & Tissue kit 


Next-generation sequencing 
library preparation was 
prepared by following 
Illumina’s TruSeq DNA 
Sample Preparation protocol 


quality of the DNA 
samples was assessed 
on a Bioanalyzer 2100, 
using a DNA 12000 Chip 
(Agilent) 


Library quantitation was 
performed using Quant-iTIM 
PicoGreen ® dsDNA Reagent. 

Sequencing was done with 
Illumina HiSeq2000 


Organising samples 


Confirmation that 

all sample data has 

been reconciled to 
study groups 


Data processing 


As arterial plaque samples 
represent a host-associated 
metagenome, all reads were 

mapped against human 
reference genome (hg19) 
using bowtie 2-2.0.0 


Metadata mapping 

(samples with any 

specific phenotypic 
or medical info) 


All unmapped reads (non- 
hg19) were extracted and 
aligned against non- 
redundant (nr) protein 
database using BLASTX 


Clean and organise 


Data quality check 
and Quality Control 
(QC) 


the samples 


Allblast output filesof 
pairedreadsequenceswere 
importedandanalyzed 
usingthepaired-end 
protocolofMEGAN 


Taxonomic 
annotation 
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Fig. 1 Analysis pipeline for the study of human atherosclerotic plaque samples. Interested readers may refer to 
the full study here [15] 


persons who died from causes not related to atherosclerotic disease 
(samples from stable plaques as controls).* 

All 22 arterial plaque samples resulted in 2,610,268,774 shot- 
gun sequencing reads. After mapping these reads against Hg19 
using bowtie 2 [17] with “very-sensitive” parameters to filter all 
human-like sequences from our samples. The average amount of 
non-Hgl9 reads is 884,727,044 (average 33.89% per sample, 
Table 1). These non-Hg19 reads were extracted and aligned against 
nonredundant (nr) protein database (version 30.07.2012) [18] 
using BLASTX (ncbi-blast-2.2.25+; Max e-value 10e—3) 
[19]. After performing the BLASTX alignment, all output files of 
paired read sequences were imported and analyzed using the 
paired-end protocol of MEGANS5 [20]. For all non-Hg19 anno- 
tated reads, 2—16% (mean 4.6%) were assigned as bacteria in differ- 
ent samples. The rest of reads were assigned to Eukaryota. Table 1 
provides details of sequencing read statistics and assignments of 
reads after different stages of data processing. R statistical 


’These samples originated from the tissue bank at the Department of Forensic Medicine (Approval 
No. 1501230). 
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22 Scenario 2: The 
Effect of Omega-3 
Polyunsaturated Fatty 
Acid Supplements on 
the Human Intestinal 
Microbiota 


2.2.1 Study Design 


programming language [21] was used for multivariate statistics. 
Later in Subheading 3, we will describe few of the analysis 
approaches revisiting this study. 

In this study our data provided evidence that suggest a wide 
range of microbial agents (some pathogens) in atherosclerotic pla- 
ques, and these microbes displayed differences between symptom- 
atic and asymptomatic plaques as judged from the taxonomic 
profiles in these two groups of patients. Further, fluorescence in 
situ hybridization (FISH) was performed to validate the presence of 
biofilm-like structures of few pathogens (which have been previ- 
ously predicted from taxonomic analyses) in the symptomatic ath- 
erosclerotic plague samples. FISH staining demonstrates the 
presence of live bacteria; thus, this is a very good approach for 
cross-validation of any computational finding in the lab. 

There are also potentials of using this data for not only taxo- 
nomic annotation but also to reveal the functional profiles through 
partial assembly of specific members and their functional annota- 
tions. Functional annotations reveal significant differences in basic 
metabolic and disease pathway signatures between these groups. 
Here, we will not provide details of the whole study, but interested 
readers may refer to [15]. 

On this occasion, it is necessary to mention that in any similar 
project in future, for alignment purpose, we would have used 
DIAMOND [22] which uses improved algorithms and additional 
heuristics and works much faster compared to available other 
aligners. Scenario l is an example of analyzing shotgun sequence 
datasets obtained from tissue samples or host-associated metagen- 
ome. In case readers have shotgun sequence datasets from environ- 
mental samples or from fecal samples, they do not need to perform 
alignment step to get rid of the host-associated sequences, unless 
there is any doubt of contamination. Normally we suggest to have 
control or blank samples in two wells per 96-well plate to address 
any issue with contaminations. 


A randomized, open-label, crossover trial of 8 weeks’ treatment 
with 4 g mixed eicosapentaenoic acid (EPA)/docosahexaenoic acid 
(DHA) in two formulations (soft-gel capsules and drinks) with a 
12-week “washout” period [16] is chosen. Healthy volunteers aged 
greater than 50 years of both genders were included in this study. 
Participants were randomized to take two types of EPA and DHA 
compositions (Fig. 2): 


1. Two 200 mL drinks per day (providing approximately as the 
triglyceride daily) at any suitable time of day, or 

2. Four soft-gel capsules (each containing 250 mg EPA and 
250 mg DHA as the ethyl ester) twice daily with meals 


(providing 2000 mg EPA and 2000 mg DHA per day), both 
for 8 weeks. 
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0 8 20 28 40 
weeks weeks weeks weeks weeks 
Intervention B 
Visit 1 Visit 2 Visit 3 Visit 4 Visit 5 


Fig. 2 Schedule of visits for the study to understand the effect of omega-3 polyunsaturated fatty acid 
supplements on the human intestinal microbiota 


After a 12-week “washout” period, participants took the sec- 
ond intervention for 8 weeks. We also included a final study visit 
after a second 12-week “washout” period (V5; Fig. 2). Fecal sam- 
ples were collected at five time-points for microbiome analysis by 
16S rRNA PCR and Illumina MiSeq sequencing. Parallel red blood 
cell (RBC) fatty acid analysis was performed by liquid chromato- 
graphy-tandem mass spectrometry. 


2.2.2 Sample Microbial DNA extractions were performed based on the method 
Preparation and of Yu and Morrison, [23] with slight modifications. DNA was 
Sequencing extracted from approximately 250 mg feces using the QIAamp 


DNA Stool Mini Kit (Qiagen, Germany) with bead beating. DNA 
Library Prep Kit for Illumina, NEBNext Singleplex Oligos for 
Ilumina (New England Biolabs, UK), and unique in-house- 
designed index primers (Integrated DNA Technologies, UK) 
were used to allow for multiplexing of samples. Twelve cycles of 
enrichment PCR were performed, and final libraries were cleaned 
with AMPure Beads (Beckman Coulter, UK). Successful libraries 
were confirmed by DNA 1000 bioanalyzer chips or DNA Analysis 
screen tapes (Agilent, UK). Quantification was performed with the 
Quant-iT dsDNA Assay Kit, broad range. A total of 30 ng of 
each library was pooled and sequenced on an Illumina MiSeq 
(2 x 250 bp) [24]. The variable region (V4) of the 16S rRNA 
gene was sequenced for these samples. 


2.2.3 Data Analyses Demultiplexed FASTQ files were trimmed of adapter sequences 
using cutadapt [25]. Paired reads were merged using fastq-join 
[26] under default settings and then converted to FASTA format. 
Consensus sequences were removed if they contained any ambigu- 
ous base calls, two contiguous bases with a PHRED quality score 
lower than 33, or a length more than 2 bp different from the 
expected length of 240 bp. Further analysis was performed using 
QIIME [27]. Operational taxonomy units (OTUs) were picked 
using usearch [28] and aligned to the Greengenes reference data- 
base using PyNAST [29]. Taxonomy was assigned using the RDP 
2.2 classifier [30]. The resulting OTU BIOM files from the above 
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2.3 Scenario 3: 
Comparing Effects of 
Two Drug Treatments 
for an Infectious 
Disease 


analyses were imported in MEGAN for detailed group-specific 
analyses, annotations, and plots [31]. R statistical programming 
language [21] was used for multivariate statistics and other plots. 
This dataset and method pipeline are purely described as an 
example for similar analyses; thus, we will not explain the results 
here, but interested, readers may see [16]. Scenario 2 is a typical 
example of analyzing 16S sequence data. In Subheading 3, we will 
describe few of the analysis approaches using data from this study. 


In a given situation suppose we need to compare treatment effect of 
two drugs (e.g., X and Y) or more, where we have time series data, 
that is, patient samples from multiple time points of the treatment 
course for both drugs. This time series data can be either collected 
every day of the treatment period or in intervals. Furthermore, for 
practical reasons we might not be able to obtain data at a desired 
day but +1/2 days. It is important to select an error threshold and 
be consistent with that throughout the project. For example, we 
need to have a similar depth of sequencing reads or need to follow 
subsample comparison as detailed later, and, also, we need to 
discard samples with very low number of reads. Further during 
alignment to reference database and during mapping to taxonomy 
similar scores and thresholds should be used for all samples (please 
check best parameter selections in individual websites while using 
specific tools). Additionally, there can be multiple fundamental 
factors in patient samples such as age, gender, and geography that 
may not contribute in a similar manner to resiliency. Figure 3 shows 
a schematic of the metadata structure, which may help to under- 
stand the complexity of a typical clinical trial. 


Drug X or 
DrugY 


Time points 


Geography 


Baseline 


Mid treatment (Day 2 


End of treatment 
(Day 9/10 to Day 
11/12) 


Followup (Day 20 
or more) 


to Day 8) 


Fig. 3 Schematic diagram of multiple factors in a clinical study 
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2.3.1 Sample 
Preparation and 
Sequencing and Data 
Analyses 


In a clinically relevant setting this type of study wants to know 
which drug works better for a similar group of patients. Patients 
are randomized between drug arms to control any selection bias. 
Usually in this type of projects as we want to compare several 
factors, we need many samples to start with. Readers are advised 
to seek statistics help to do power calculation to obtain the pre- 
ferred sample size. In general, as we end up having hundreds of 
samples, we usually go for 16S sequencing as a cost-effective solu- 
tion. However, some projects can also use shotgun sequencing. 
Similar to previous examples, we assume that we have sequenced 
(either 16S or shotgun sequencing) our samples and performed 
further analysis process as outlined earlier to obtain taxonomic 
profile (following data analyses methods as described in previous 
scenarios) for each patient at each time point. Besides analyzing 
time series of each individual separately, we have also grouped them 
in certain time points such as baseline, mid-treatment, end of 
treatment, and follow-up. Besides treatment groups, patients are 
also compared based on multiple factors such as age, gender, and 


geography. 


3 General Methods for Annotation and Statistical Analyses 


3.1 Taxonomic and 
Functional Annotation 


Broadening our focus beyond these studies, additional analysis 
techniques are explained below which are used in these studies 
and also can be used in similar projects. 


Taxonomic annotation addresses the question, ‘Who Ze out there?’ or 
in other words tries to obtain information regarding the species 
composition of a given metagenome. On the other hand, func- 
tional annotation attempts to answer the question, ‘What are they 
doing?’ There are different approaches for metagenome analyses, 
among which one type of approach is to use phylogenetic markers 
to distinguish between different species in a sample. The most 
widely used marker is the small subunit ribosomal ribonucleic acid 
(SSU rRNA) gene (16S or 18S) and a second type of method is 
based on analyzing the nucleotide composition of reads. In a 
supervised approach the nucleotide composition of a collection of 
reference genomes is used to train a classifier, which is then used to 
place a given set of reads into taxonomic bins. In an unsupervised 
approach, reads are clustered by composition similarity and then 
the resulting clusters are analyzed in an attempt to place the reads. 
Subheading 4 of this chapter provide details of multiple approaches 
and available different tools which readers can use according to 
their preferences. 

In general, for annotating 16S rRNA sequences we use QIIME 
[27] and for shotgun sequencing we use MEGAN [31] which can 
also be used for 16S. MEGAN is a highly efficient program for 
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3.2 Metagenome 
Assembly 


3.3 Rarefaction 
Curves 


interactive analysis and comparison of microbiome data, allowing 
one to explore hundreds of samples and billions of reads. While 
taxonomic profiling is performed based on the NCBI taxonomy, 
MEGAN also provides a number of different functional profiling 
approaches. MEGAN Community Edition also supports the use of 
metadata in the context of principal coordinate analysis and cluster- 
ing analysis [31]. In all the three scenarios explained in this chapter, 
MEGAN is used as primary tool for annotations. For more details 
on MEGAN tool, see Chapter 23. 

If we have shotgun sequencing then we have good option for 
functional annotation, but with 16S sequences we can only perform 
taxonomic analyses with confidence although there are few tools 
which might predict metagenome functional content from marker 
genes [32, 33]. Most shotgun annotation pipelines (such as 
MEGAN [31], MG-RAST [34], IMG/MER [35], EBI Metage- 
nomics [36]) support functional annotations and they often use 
databases such as KEGG [37], SEED [38], eggNOG [39], and 
COG/KOG [40], as well as protein domain databases such as 
TIGRFAM [41] and PFAM [42]. 


Similar in nature to the genomic assembly, which is the reconstruc- 
tion of genomes from the sequenced DNA segments (or reads), 
metagenome assembly is more complex. The main goal is to stitch 
together the fragments of the reads that could be from the same 
genome. Here the reads consist of mixture of DNA from different 
organisms and also may have widely different levels of abundance. 
Few recent reviews discussed new challenges and opportunities as 
well as assessed the most common and freely available metagenome 
assembly tools with respect to their output statistics, their sensitiv- 
ity for low-abundance community members and variability in 
resulting community profiles as well as their ease of use. Interested 
readers please refer to reviews [43, 44]. 


Rarefaction curves represent a powerful method for comparing 
species richness among habitats on an equal-effort basis based on 
the construction of the so-called rarefaction curves [45]. This is a 
very useful tool for statistical data analyses that helps us to Correct 
for bias in species number due to unequal sample sizes by standar- 
dization to the number of species expected in a sample if it had the 
same total size as the smallest sample. As an example, we have two 
sample groups, first having 50 individuals and second 30 individuals 
with multiple number of species obtained from their taxonomic 
analyses. Rarefaction helps us to compare the situation, if we would 
have same number of individuals in two sample groups. Rarefaction 
curves are used differently in case of 16S and shotgun metage- 
nomics. Ni and colleagues have described methods for estimating 
a reasonable and practical amount for SSU rRNA gene sequencing 
and explained how much metagenomic sequencing is enough to 
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Fig. 4 Rarefaction. Rarefaction plot using annotated species profile for all 22 (unstable and stable) athero- 
sclerotic plaque samples. These curves show the number of nodes that would be present if based on 10%, 
20%, and up to 90% of the reads 


achieve a given goal [46]. In metagenomic shotgun sequencing, 
the fraction of the metagenome represented in the data set is 
termed coverage, which can be assessed through rarefaction 
curve. Interested readers may refer to a recent publication which 
has advocated for the estimation of the average coverage obtained 
in metagenomic studies, and briefly presented the advantages of 
different approaches [47]. 

In Scenario 1, for comparing case and control groups from 
human atherosclerotic plaque samples, we computed rarefaction 
curves from the normalized profile of 22 samples using the bacterial 
reads, showing the number of nodes that would be present in the 
analysis if based from 10% to 90% of the reads (Fig. 4). From 
sequence statistics (Table 1) and the rarefaction curve (Fig. 4), it 
is apparent that 2 (sample 233 and 238) of the 22 samples had 
much higher sequencing depth than the other samples. Later in the 
study we therefore omitted these two samples from merged 
case vs. control analyses. 
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3.4 Subsample 
Comparison 


3.5 Comparative 
Visualization 


Baseline 


Baseline 


Similarly, in Scenario 2 also, rarefaction was performed at vari- 
ous levels to compare diversity for different sample groupings. All 
groups were rarefied to the lowest read number, and the diversity 
calculated using weighted and unweighted UniFrac as well as the 
non-phylogenetic Bray—Curtis dissimilarity measure. 


In situations like Fig. 3, where two samples have much higher 
sequencing depth, another option can be subsample comparison. 
In this process without excluding high-depth samples from further 
study, another approach is to simulate subsample of lowest sample 
size (of other samples in the study) for sufficient number of times. 
And then take a median of the subsamples to generate a pseudo 
profile, which can serve as a good comparable sample for the group. 
For example, ifin a study for most of the samples sequence reads are 
in a range of 200,000-—300,000. However, only few samples have 
approx. l million reads, in those cases we simulate subsample of 
200,000 reads from them for large number of times (say 1000) and 
we take median of the profiles, which we can then compare with 
other samples. 


Comparative visualization includes different types of plots and 
charts (pie charts, histograms, and many other kinds of plots) 
which can help us to draw basic conclusions regarding our data. 
For example, Fig. 5 depicts basic comparison of patients in two 
drug treatment groups for certain time points such as baseline, 
mid-treatment, end of treatment and follow up (from Scenario 3). 


Genus level comparison at multiple treatment time points for Drug X 


Mid treatment End of treatment Follow up 


Genus level comparison at multiple treatment time points for Drug Y 


N 


Mid treatment End of treatment Follow up 


Fig. 5 Genus level taxonomic comparison of patients’ microbiome (median of each time point group) in two 
drug treatment groups for certain time points such as baseline, mid-treatment, end of treatment and follow 
up. Here different colors indicate different genera and the size of each color in the pie reflects the percentage 
of those genus in median microbiome for each time point group and for each drug 
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Fig. 6 Tree view at “family” level taxonomy comparing merged data from cases and control samples using 


data from Scenario 1 


3.6 Diversity 
Analyses 


Form this figure we can easily see that the microbiome pattern in 
drug X over treatment period is more consistent (or more stable 
over the time) than in drug Y. Here with visual comparison we are 
not making any conclusion, but with these types of plots we can 
start to see if there is any trend in our data, which can later be 
investigated with appropriate statistical tests. 

Further as metagenomic data are often hierarchical in nature, 
besides doing basic plots which can be done only at certain taxo- 
nomic levels (e.g., family/genus), often it is helpful to display the 
whole data as comparative tree view. For example in Scenario 
1, samples from cases and controls have grouped closely (as can 
be seen later in Subheading 3.9), we can explore their broad differ- 
ences by comparing total biome from cases and controls using 
comparative tree view (Fig. 6). This kind of tree view also help us 
to assess multiple time point samples from single patient or 
grouped data comparison for multiple factors (e.g., in Scenario 3). 


Diversity analyses is one of the prominent statistical analysis 
approaches that address some of the downstream analysis steps 
associated with metagenomic studies. Species abundance estimates 
in the community are used to make inference about diversity on the 
whole community. The terms alpha, beta, and gamma diversity 
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3.7 Comparison 
Using Distance 
Matrices 


3.8 Boxplots 


were all introduced by R. H. Whittaker to describe the spatial 
component of biodiversity [48 ]. Alpha diversity is just the diversity 
of each site (samples in each group). Beta diversity represents the 
differences in species composition among sites. Gamma diversity is 
the diversity of the entire landscape of different sites (all species 
pool from multiple samples). A diversity index measures how many 
different types (such as species) are there in a dataset (a community) 
and simultaneously takes into account how evenly the basic entities 
(such as individuals) are distributed among these types. Three 
commonly used measures of diversity, Simpson’s index, Shannon’s 
entropy, and the total number of species, are related to Renyi’s 
definition of a generalized entropy, and are well explained and 
compared by Hill [49]. Interested readers may also refer to [50] 
for consistent terminology for quantifying species diversity. Many 
other publications also explain this topic very well. 


Another common technique to compare metagenomic datasets is 
using distance matrices. First, a taxonomic profile is computed for 
each data set. Second, a matrix of pairwise distances is determined 
using one of several possible ecological indices. Finally, the dis- 
tances are represented using an appropriate visualization technique. 
Mitra et al. [51] explained multiple distance matrices (such as 
Bray—Curtis, Kulczynski, y”, Hellinger, and Goodall) in the context 
of multiple metagenome comparison. In addition to these UnzFrac 
is another distance metric used for comparing biological commu- 
nities. It differs from dissimilarity measures such as Bray—Curtis by 
incorporating information on the relative relatedness of community 
members by incorporating phylogenetic distances between 
observed organisms in the computation [52-54]. Both weighted 
(quantitative) and unweighted (qualitative) variants of UniFrac are 
often used in microbial ecology, where the former accounts for 
abundance of observed organisms, while the latter only considers 
their presence or absence. 


In descriptive statistics, “boxplot” or alternatively called “box and 
whisker plot,” is an important and one of the most informative 
tools that is used for graphically depicting groups of numerical data 
through their quartiles [55]. The boxplot is a quick way of examin- 
ing multiple groups of data graphically, which easily provides infor- 
mation regarding quartiles, range, variation, and even outliers and 
enables us to compare within and between group samples. For 
example, Fig. 7 shows distribution of samples in multiple time 
point for both drugs (example data in Scenario 3). From this plot 
we can clearly gather the idea that diversity with drug X is consis- 
tently higher than that with drug Y. Further in Fig. 5 we have 
already seen that microbiome pattern in drug X showed less disrup- 
tion, thus from these two figures we can hypothesize that drug Y 
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Fig. 7 Boxplot showing Simpson diversity indices for samples from each time point and for both the drugs X 


and Y 


3.9 Hierarchical 
Clustering 


being more disruptive to the microbiome. Such hypotheses can 
help us in further statistical analyses. 


Cluster analysis, especially hierarchical clustering [56, 57], is an 
important tool for the exploratory and unsupervised analysis 
(where we do not need a training dataset to feed the programme) 
of high dimensional datasets and often used in genomics and other 
fields for their ability to simultaneously uncover multiple layers of 
clustering structure. In our example, Fig. 8 depicts a hierarchical 
clustering result of family level taxonomic comparison data for all 
22 samples. Interestingly, samples 238 and P0613 were mostly 
different, and among the other samples, all unstable plaques clus- 
tered together, apart from all stable plaque controls that clustered 
separately. 

Interestingly, the asymptomatic atherosclerotic plaques have 
more abundance of host microbiome-associated microbial families 
such as Porphyromonadaceae, Bacteroidaceae, Micrococcaceae, and 
Streptococcaceae than the symptomatic atherosclerotic plaques. In 
contrast, the symptomatic atherosclerotic plaques have more abun- 
dance of pathogenic microbial families such as Helicobacteraceae, 
Neisseriaceae, and sulfur-consuming families such as sulfur- 
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Fig. 8 Taxonomic comparison of all DNA samples. Hierarchical clustering result of “family” level taxonomic 
comparisons of data from Scenario 1: unstable atherosclerotic plaques from 15 patients with symptomatic 
atherosclerotic disease (unstable plaques) and stable plaques from a control group of seven patients that died 
from other causes than atherosclerosis (controls). Red indicates downregulation, green indicates upregulation, 
and black indicates no change in read abundance level comparing to all samples. Hierarchical clustering was 
computed with average linkage, whereas Pearson correlation was used for clustering the families (rows) and 
Spearman correlation was used for clustering the datasets (columns), respectively 
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3.10 Principal 
Component Analysis 
(PCA) and Principal 
Coordinates Analysis 
(PCoA) 


3.11 Canonical- 
Correlation Analysis 
(CCA) and Canonical- 
Correspondence 
Analysis (CCA) 


oxidizing symbionts and Thiotrichaceae than the asymptomatic 
atherosclerotic plaques (Fig. 8). For P0613, the species profile 
appeared very different from all other samples. Thus, this sample 
also treated as an outlier in further analyses (see [15 ] if interested in 
actual study). 


PCA and PCoA are tools for multivariate analysis. PCA uses an 
orthogonal transformation to convert a set of observations of 
possibly correlated variables into a set of values of linearly uncorre- 
lated variables called principal components [58]. This is often used 
for quantitative variables, so the axes in graphic have a quantitative 
weight, and the positions of the samples are in relation with those 
weight. On the other hand, PCoA or multidimensional scaling 
(MDS) is a means of visualizing the level of similarity of individual 
cases of a dataset [59]. PCoA is similar to Polar ordination (PO; 
[60]) arranges samples between endpoints or ‘poles’ according to 
the distance matrix maximizing the linear correlation between the 
distances in the distance matrix. If further interested in these meth- 
ods please see [61]. 

For multiple sample comparison we often use PCoA and PCA, 
these are among the best tools available for multivariate analysis. 
These can give us powerful information of similarities and dissim- 
ilarities within samples. When coupled with phenotypic data or 
metadata (using colors and symbols etc.), these can be very helpful 
tools to understand within group variations. As an example, we 
have used PCoA on 22 plaque samples from Scenario 1 (Fig. 9). 
Here we can see that sample 238 and 238 being very different 
possibly due to high sequence depth (as also seen in Fig. 4). 

Biplots: In addition to PCA or PCoA, variables can also be 
plotted on the same diagram (this is called a bzplot). The biplot 
provides a useful tool of data analysis and allows the visual appraisal 
of the structure of large data matrices [62 ]. In our examples, where 
taxa are variables, biplot can show important taxa which helps in 
determining relatedness represented as arrows. For example, in 
Scenario 2, B diversity was compared using principal coordinate 
analysis (PCoA) on all samples from all visits, where biplots are 
displayed with green arrows (Fig. 10). From this PCoA with biplot, 
we interpret that samples from volunteers 8, 13, and 16 are differ- 
ent than the other volunteers and that they have higher abundance 
of Succinivibrionaceae, Gammaproteobacteria, Aeromonadales, etc. 


CCA (correlation) seeks to find the linear combination of the X; 
and Y; that have the greatest correlation with each other where 
X=(X,..., Xn) and Y=(N,..., Vn) of random variables thus it 
is often used as a dimension—-reduction method. The method was 
first introduced by Harold Hotelling [63]. On the other hand, 
CCA (correspondence) is a multivariate method to elucidate the 
relationships between biological assemblages of species and their 
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Fig. 9 principal coordinate analyses (PCoA) of “family” level taxonomic comparisons of data from Scenario 1: 
unstable atherosclerotic plaques from 15 patients with symptomatic atherosclerotic disease (cases: cyan) and 
stable plaques from a control group of seven patients that died from other causes than atherosclerosis 
(controls: magenta) 


environment. This method by Cajo J. F. ter Braak involves a canon- 
ical correlation analysis and a direct gradient analysis [64]. By envi- 
ronment we mean any kind of metadata, such as some 
physicochemical parameters obtained from same group where the 
species data is obtained. The idea is to relate the prevalence of a set 
of species to a collection of environmental variables. Biplots are 
often used in CCA (correspondence) for visualization purpose. For 
example, in our Scenario 2, a typical illustration of correlation and 
correspondence analyses between the microbiome and RBC fatty 
acid data is displayed in Fig. 11. 

In this occasion it is important to note that CCA does not 
perform variable selection. Further, when the number of variables 
exceeds the number of observations (or sample size), CCA cannot 
be applied directly due to singularity of the covariance matrix. In a 
recent study [65 ] the authors have discussed this problem and a few 
existing solutions. Additionally, they developed a method for 
structure-constrained sparse canonical correlation analysis 
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Fig. 10 principal coordinate analyses (PCoA) of level taxonomic comparisons of data from Scenario 2: all 
samples (V1—V5) for all participants, where biplots are displayed with green arrows. Each visit is denoted by a 


different color 


3.12 Multivariate 
Analyses 


(ssCCA) in a high-dimensional setting. ssCCA takes into account 
the phylogenetic relationships among bacteria, which provides 
important prior knowledge on evolutionary relationships among 
bacterial taxa (see [65] if interested). 


Multivariate data analysis refers to any statistical approach used to 
analyze data with more than one variable. For example, as described 
in Scenario 3 we have multiple factors. The key to identifying 
important microbial taxa associated with two treatments is that 
the large datasets from each patient are compared within groups, 
and then the metadata from the patients’ groups are compared 
against each other. Analysis of multivariate data in response to 
factors, groups, or treatments in an experimental design needs 
sophisticated methods. 

To achieve this, we can use PERMANOVA (permutational 
multivariate analysis of variance) [66] to test the homogeneity of 
multivariate dispersions within groups, on the basis of any resem- 
blance measure. PERMANOVA is a better approach than ANOVA 
(Analysis of variance) /MANOVA (Multivariate analysis of variance) 
for our study as PERMANOVA works with any distance measure 
that is appropriate to the data, and uses permutations to make it 
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Fig. 11 (a) Pearson correlation between genus level microbiome and RBC fatty acid data. (b) Canonical 
correspondence analysis of microbiome (genus level taxonomy) distribution in relation to blood parameters 
(biplot: represented by blue arrows). Red crosses represent taxa and black circles represents individual 


samples 


distribution free, unlike assuming normal distributions. Finally, in 
addition to the above multiple comparisons, we can examine if 
there is consistency of microbiota changes and patterns across the 
geographical locales of treatment subjects; as our samples are from 
different countries. We are not showing the details of multivariate 
analyses, but there are multiple available packages for such analyses 
with good tutorials. Interested readers may visit these packages and 
websites as detailed below. 

The Primer-E package [67] is commonly used by microbial 
ecologists and allows for multiple multivariate statistical analyses. 
We often use R statistical programming language [21] for multi- 
variate statistics. Moreover R is used for several types of graphical 
representations. Particular packages provide in-built functions and 
libraries (within R environment) specially for metagenomic datasets 
such as Bioconductor [68], vegan [69], and phyloseq [70]. 


4 Tools and Packages Commonly Used in Metagenomic Studies 


A list of multiple tools is provided below for analyzing metage- 
nomic data from raw sequence reads to final comparisons and 
statistical analyses. Discussion of all these tools are beyond the 
scope of this chapter, but interested readers can see recent review 
articles [71-74] and it must be noted that there can be other tools 
as well outside this list. 
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l. Processing of raw sequence reads and quality control (QC): 


(a) 


(h) 


FastQC (https: //www.bioinformatics.babraham.ac.uk/ 
projects /fastqc/). 

Fastx_toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). 
Cut-adapt (both adapter trimming and quality trim) [25]. 
BBTools (http: //jgi.doe.gov/data-and-tools/bbtools/). 
Condetri (Read trimmer for Illumina data) [75]. 
Trimmomatic (allows multiple threads) [76]. 

SolexaQA [77 ]. 

PRINSEQ [78]. 


2. Alignment tool: 


BLAST [18]. 
USEARCH [28]. 
DIAMOND [22]. 
Rapsearch [79]. 
PyNAST [29]. 


3. Analyses for 16S projects: OTU clustering, picking, and taxo- 
nomic assignment. 


(a) 
(b) 
(c) 
(d) 
(e) 


QIIME [27]. 

USEARCH [28]. 

RDP classifier [30]. 

SILVA (for 16S + 18S) [80]. 

Mothur [81]. 

SILVAngs (https://www.arb-silva.de/documentation/ 
silvangs/). 

MEGAN [31]. 

AmpliconNoise [82]. 


Open reading frame (ORF) prediction, for example, with 
MG-DOTUR [83]. 


4. Assembly of shotgun metagenomics data. 


(a) 


(b) 


Reference-based assembly. 

e MIRA 4 [84]. 

e MetaAMOS (https://www.cbcb.umd.edu/software/ 
metamos). 

De novo assembly. 

e Newbler (Roche). 

e iAssembler [85]. 

e EULER [86]. 
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e Velvet [87]. 
e SOAP [88]. 
e Abyss [89]. 

(c) The next generation of assembly tools. 
e MetaVelvet-SL [90]. 
e Meta-IDBA [91]. 
e InteMAP [92]. 
e SAT-Assembler [93]. 
e IDBA-UD [94]. 

5. Removing near-exact matches by maping to specific genomes. 
(a) Bowtie 2 [17]. 
6. Binning tools for metagenomes. 

(a) Composition-based binning algorithms. 
e S-GSOM [95]. 
e PhylopythiaS [96]. 
e TACAO [97]. 
e PCAHIER [98]. 
e ESOM [95]. 
e ClaMS [99]. 

(b) Similarity-based binning software include tools. 
e MEGAN [31]. 
e IMG/MER 4 [35]. 
e MG-RAST [34]. 
e CARMA [100]. 
e MetaPhyler [101]. 

(c) Unsupervised binning. 
e PhylopythiaS+ [102]. 
e PhymmBL [103]. 
e ESOMs [104]. 
e VizBin [105]. 
e IFCM (fuzzy c-means method) [106]. 


7. Binning of metagenome contigs for reconstructing single 
genomes. 


(a) ICoVeR [107]. 
) MyCC [108]. 

c) MetaBAT [109]. 
) GroopM [110]. 
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10. 


ll. 


12. 


13. 


(e) MaxBin2 [111]. 
(f) CONCOCT [112]. 


. Identification of genes within the reads/assembled contigs or 


“gene calling”. 

(a) MetaGeneMark [113]. 
(b) Prodigal [114]. 

(c) Orphelia [115]. 

(d) FragGeneScan [116]. 


. Predict for clustered regularly interspaced short palindromic 


repeats (CRISPRs). 
(a) CRT [117]. 
(b) PILER-CR [118]. 
(c) IMG/MER [35]. 
Annotation pipelines. 
(a) MEGAN [31]. 
(b) QIIME for 16S projects [27]. 
(c) Galaxy platform. 
(d) MG-RAST [34]. 
(e) IMG/MER [35]. 
(f£) Primer-E package [67]. 
(g) Several packages built within R [21]. 
e Vegan [69]. 
e Phyloseq [70]. 
e Bioconductor [68]. 


) 
) 
) 
) 


Prediction of functional content from metagenomics. 
(a) PICRUSt [33]. 

(b) Tax4Fun [32]. 

Statistical computing. 

(a) R[21]. 

(b) Many other tools can be used for statistical analyses. 
Web service for the analysis of metagenomic data. 

(a) The EBI Metagenomics service [36]. 
European Nucleotide Archive (ENA). 
MG-RAST [34]. 

METAGENassist [119]. 

BusyBee Web [120]. 

(f) Meta4 [121]. 
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5 Concluding Remarks 


References 


This chapter has illustrated multiple data analyses and annotation 
techniques in metagenomic studies with three case studies. This is 
not a chapter about any new method development but a description 
of optimized pipelines using various available tools. With these 
example scenarios, the use of multiple pipelines has been demon- 
strated to analyze and interpret the data starting from very raw 
sequence to the final statistical outputs. Example scenarios describe 
some of the tools that we have used for analyzing the projects 
selected for demonstration, but besides these there are plenty of 
other available tools for metagenomics, most of which are listed in 
Subheading 4. This chapter does not provide the details of the tools 
or describe their pros and cons but this can be a good starting point 
for the readers to explore available options to analyze and interpret 
their datasets. From this chapter readers shall get an idea of current 
research projects in medical studies and multiple approaches used 
to analyze the data originating from these projects, although read- 
ers should keep in mind that this is not an exclusive list of possible 
pipelines for analyzing metagenomic samples. There might be 
other approaches as well. While step-by-step instructions of all the 
tools is beyond the scope of this chapter, the methods outline here 
might be useful to researchers to plan, analyze, and interpret their 
research projects successfully. 
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Abstract 


Systems genetics combines high-throughput genomic data with genetic analysis. In this chapter, we review 
and discuss application of systems genetics in the context of evolutionary studies, in which high-throughput 
molecular technologies are being combined with quantitative trait locus (QTL) analysis in segregating 
populations. 

The recent explosion of high-throughput data—measuring thousands of RNAs, proteins, and metabo- 
lites, using deep sequencing, mass spectrometry, chromatin, methyl-DNA immunoprecipitation, etc.— 
allows the dissection of causes of genetic variation underlying quantitative phenotypes of all types. To deal 
with the sheer amount of data, powerful statistical tools are needed to analyze multidimensional relation- 
ships and to extract valuable information and new modes and mechanisms of changes both within and 
between species. In the context of evolutionary computational biology, a well-designed experiment and the 
right population can help dissect complex traits likely to be under selection using proven statistical methods 
for associating phenotypic variation with chromosomal locations. 

Recent evolutionary expression QTL (eQTL) studies focus on gene expression adaptations, mapping the 
gene expression landscape, and, tentatively, define networks of transcripts and proteins that are jointly 
modulated sets of eQTL networks. Here, we discuss the possibility of introducing an evolutionary “prior” 
in the form of gene families displaying evidence of positive selection, and using that prior in the context of 
an eQTL experiment for elucidating host-pathogen protein-protein interactions. 

Here we review one exemplar evolutionairy eQTL experiment and discuss experimental design, choice of 
platforms, analysis methods, scope, and interpretation of results. In brief we highlight how eQTL are 
defined; how they are used to assemble interacting and causally connected networks of RNAs, proteins, and 
metabolites; and how some QTLs can be efficiently converted to reasonably well-defined sequence variants. 


Key words Systems genetics, Genetical genomics, QTL, eQTL, xQTL, R-genes, Evolution, R/qtl, 
LMM, GEMMA, NGS, Genomics, Metabolomics, Network inference, GeneNetwork 


1 ‘Introduction 


Genetics concerns the study of heritably quantitative or complex 
traits. Many agricultural traits of interest, such as milk production 
in cattle and response to fertilizer in crops and most human, animal, 
and plant diseases, are complex traits. Associating, or linking, 
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complex traits with certain positions on the genome is achieved 
through the mapping of the so-called quantitative trait loci (QTL). 

Mapping QTL in experimental populations is possible when 
linkage and/or association information is available. When we have a 
population of individuals with known genotypes, it may be possible 
to link a phenotype with a certain genotype. To genotype indivi- 
duals, first marker maps are created. A marker is a known genomic 
location, where the genotype of an individual can be determined. In 
the early days, the genotype was determined by visible chromosome 
features, later with restriction fragment length polymorphism 
(RFLP) and amplified fragment length polymorphism (AFLP, see 
also [1-3 ]), and, increasingly, with SNP/haplotype data [4]. When 
all individuals with genotype A at a marker location somewhere on 
the genome are susceptible to a disease and all other individuals 
with genotype B are not, there is linkage /association or a QTL. Ifit 
is clear cut, i.e., single QTL explains all phenotype variance, it is 
likely to be a single gene effect. Often it is not clear cut, and we 
need statistics to determine the strength of association between 
phenotype and genotype. 

It is also possible to use linkage disequilibrium (LD) to map 
QTL in outbred and natural populations. LD occurs when certain 
stretches of the genome (haplotypes) show nonrandom behavior 
based on allele frequencies and recombination. Associating haplo- 
type frequencies with phenotypes potentially renders QTL. Kim 
et al. describe the genome-wide pattern of LD in a sample of 
19 Arabidopsis thaliana accessions using SNP microarrays 
[5]. LD is tested, for example, by Dixon et al., to globally map 
the effect of polymorphism on gene expression in 400 children 
from families recruited through a proband with asthma [6]. 

The use of terms “association” and “linkage” can be confusing, 
even in literature. In this text we use association with haplotypes in 
natural populations of unrelated individuals and linkage with mar- 
kers in families and groups of families, often termed experimental 
populations. Note some genetic studies are hybrids of both meth- 
ods, such as Dixon et al. [6], and individuals are related, i.e., some 
within-family linkage information is available for 400 children from 
206 families which should be accounted for in the analysis. 

Statistical power can be increased by using experimental crosses 
instead of natural populations. For example, each individual line in 
a set of recombinant inbred lines (RILs) is homozygous across the 
genome, doubling the genetic variance, simplifying genetic models, 
and increasing statistical power. For model organisms, such as 
A. thaliana, Caenorhabditis elegans, Drosophila melanogaster, and 
Mus musculus, genotyped and even fully sequenced experimental 
crosses are available; i.e., for these species it is not necessary to 
generate a new cross, and for these crosses comprehensive SNP and 
sequence data may be available. One of the features of inbred model 
organisms is that they are “immortal” which means that 
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experiments conducted more than 10, even 30, years ago can still 
be compared with those today. Databases, such as GeneNetwork 
[7, 8], contain thousands of studies conducted on the same indi- 
vidual mouse strains. 

Systems genetics combines genetics with high-throughput 
molecular technologies. Combining gene expression, as measured 
by microarray probes or RNA sequencing, with linkage leads to 
gene expression QTL (eQTL). Such eQTL studies elucidate how 
genotypic variation underlies, for example, morphological pheno- 
types, by using gene expression levels as intermediate molecular 
phenotypes. In other words, the expression level, as measured by a 
microarray probe or probe set, is treated as a phenotype, i.e., a gene 
expression trait. This phenotype is associated with the genome in 
the form of one or more eQTL. With microarrays, the genomic 
location of the probe is usually known. Therefore, expression phe- 
notype and probe connect two types of genomic information: 
eQTL location(s) and gene location. It is usually assumed that 
eQTL loci represent cis- or trans-transcription regulators of the 
target gene [9]. If the eQTL is located close to the gene on the 
genome, the eQTL may point to a cis-regulator. If the eQTL is 
located far from the gene on the genome, the eQTL may point to a 
trans-regulator of a single gene or even eQTL trans-bands that 
regulate multiple genes (see Fig. la and [10, 11]). 

In a similar fashion, proteins and metabolites can be measured 
to map protein QTL ( pQTL) and metabolite QTL (mQTL). A 
remarkable study published in 1994 used two-dimensional protein 
electrophoresis and a restriction fragment length polymorphism 
map (RFLP) [12]. Deep sequencing, chromatin, and methyl- 
DNA immunoprecipitation are just a few of the latest technologies 
that add to the arsenal of tools available for the study of the genetic 
variation underlying quantitative phenotypes. Together, eQTL, 
mQTL, and pQTL are referred to as QTL. Different xQTL appear 
to confirm each other, for example, with the A. thaliana glucosi- 
nolate pathway where eQTL, mQTL, and pQTL were mapped 
together and used to infer the underlying pathways [13]. Such 
causal inference can lead to dissecting pathways and gene networks 
which is an active field of research, e.g., [14-16] (see also Fig. 1). 


From the perspective of evolutionary biology, systems genetics has 
been applied to elucidate evolutionary adaptations of transcript 
regulation. For example, Fraser et al. introduced a test for 
lineage-specific selection on gene expression and analyzed the 
directionality of microarray eQTL for 112 haploid segregants of a 
genetic cross between two strains of the budding yeast Saccharomy- 
ces cerevisiae, reanalyzing the two-color cDNA microarray data of 
Brem and Kruglyak [17]. They found that hundreds of gene 
expression levels have been subjected to lineage-specific selection. 
Comparing these findings with independent population genetic 
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(a) Prior (gene clusters): 
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(b) Network inference: 
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Fig. 1 In this hypothetical and schematic example related to mapped locations on a chromosome, prior 
information is combined with multiple phenotype-genotype QTL mappings to zoom in on genomic areas and to 
reason about causal relations between different layers of information. (a) The prior (red area on the 
chromosome) points out that certain sections are of interest; these sections consist of related genes with 
high homology showing evidence of positive selection, as discussed in the main text. The blue double arrow 
points out the confidence interval for each QTL, above the significance threshold (red dotted line). The 
accumulated evidence (light-blue areas) leads to a narrowed down section on the genome, where in this case 
the prior information is the most specific. In addition, expression phenotypes A and B point to exact gene 
locations (dotted line, based on exact probe information). (b) To infer causal relationships, network inference is 
possible. On the left (vertical |), traits A, B, and D map to one hot spot, where A may be a regulator of B 
because one QTL is shared. B causes metabolite phenotype C, again a shared QTL. Phenotype D matches A 
and B, and phenotype E matches A, B, and C. These causal relationships are drawn by arrows. The figure 


suggests that, even if individual QTL are not very informative, the accumulated evidence starts to paint a 
picture 
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evidence of selective sweeps suggests that this lineage-specific selec- 
tion has resulted in recent sweeps at over a hundred genes, most of 
which led to increased transcript levels. Fraser et al. suggest that 
adaptive evolution of gene expression is common in yeast, that 
regulatory adaptation can occur at the level of entire pathways, 
and that similar genome-wide scans may be possible in other spe- 
cies, including human [18]. 

In another S. cerevisiae study, Zou et al., by reanalyzing the 
same two-color cDNA microarray data, uncovered genetic regu- 
latory network divergence between duplicate genes. They found 
evidence that the regulation of the ancestral gene diverged due to 
gene duplication [19]. 

Liet al. studied plasticity of gene expression in C. elegans, using 
a set of 80 RILs generated from a cross of N2 (Bristol) and CB4856 
(Hawaii), representing two genetic and ecological extremes of 
C. elegans. While the overall level of polymorphism among wild 
isolates of C. elegans is relatively low, the genetic distance between 
N2 and CB4856 is high, representing millions of years of genetic 
drift. Differential expression induced in a RIL population by tem- 
peratures of 16 °C and 24 °C has a strong genetic component. With 
a group of transgenes, there was prominent evidence for a common 
master regulator: an eQTL trans-band of 66 coregulated genes 
appeared at 24 °C. The results suggest widespread genetic variation 
of differential expression responses to environmental impacts and 
demonstrate the potential of systems genetics for mapping the 
molecular determinants of phenotypic plasticity [11], leading to a 
more generalized systems genetics, where value is added from 
environmental perturbation [20]. 

Hager et al. determined that genetic architecture supports 
mosaic brain evolution and independent brain-body size regulation 
by a quantitative genetic approach involving over 10,000 BXD 
mouse RILs. The BXD family consists of over 100 lines derived 
from parental strains that differ at five million single nucleotide 
polymorphisms (SNPs), indels, transposons, and copy-number var- 
iants. This model system harbors naturally occurring genetic varia- 
tion at a level approximating that of human populations. The study 
utilizes a high-density linkage analysis to map loci modulating 
phenotypic variation in overall brain size, body size, and the size 
of seven major brain parts: neocortex, cerebellum, striatum, olfac- 
tory bulb, hippocampus, lateral geniculate nucleus, and basolateral 
complex of the amygdala. Under the mosaic evolutionary hypoth- 
esis, the size of different systems evolves independently due to 
differential selective pressures associated with different tasks. They 
identified independent loci for size variation in seven key parts of 
the brain and observe that brain parts show low or no phenotypic 
correlation, as is predicted by a mosaic scenario. They also demon- 
strate that variation in brain size is independently regulated from 
body size [21]. 
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1.2 Adding a Prior 


Kliebenstein et al. detected significant gene network variation 
in 148 RILs originating from a cross between two A. thaliana 
accessions, Bay-0 and Shahdara. They were able to identify eQTL 
controlling network responses for 18 out of 20 a priori defined 
gene networks, representing 239 genes [22]. 

According to Gilad, eQTL studies show that (1) variation in 
gene expression levels is both widespread and highly heritable; 
(2) gene expression levels are highly amenable to genetic mapping; 
and (3) most strong eQTL are found near the target gene, suggest- 
ing that variation in cis-regulatory elements underlies much of the 
observed variation in gene expression levels [23]. Meanwhile, 
Alberts et al. suggest that sequence polymorphisms influencing 
the binding of microarray probes may cause many false cis eQTL, 
which should be accounted for [24]. 


QTL mapping links complex traits with one or more locations on 
the genome (see Fig. 1). Such a location is a wide measure because a 
QTL is a statistical estimate and rarely a precise indicator. On the 
genome, a single QTL may represent tens, hundreds, and even 
thousands of real genes. Combining the QTL with high- 
throughput technologies, such as microarrays, can add informa- 
tion. To zoom in on the genes underlying QTL, information from 
other sources has to be utilized. Such æ priori knowledge (prior) 
could consist of results from traditional linkage studies or associa- 
tion studies of, for example, human disease. That way one can 
assign a specific regulatory role to polymorphic sites in a genomic 
region known to be associated with disease [23]. Other useful 
priors can be derived from existing information on gene ontology 
terms, metabolic pathways, and protein-protein interactions, which 
can be used to identify genes and pathways [25], provided these 
databases are sufficiently informative. 

Zou et al., for example, used gene ontology as a prior and 
concluded that trans-acting eQTL divergence between duplicate 
pairs of genes is related to a fitness defect under treatment condi- 
tions, but not with fitness under normal condition [19]. 

Chen et al. identified strong candidate genes for resistance to 
leaf rust in barley and on the general pathogen response pathway 
using a custom barley microarray on 144 doubled haploid lines of 
the St/Mx population [26]. Fifteen thousand six hundred and 
eighty-five eQTL were mapped from 9557 genes. Correlation anal- 
ysis identified 128 genes that were correlated with resistance, of 
which 89 had eQTL colocating with the phenotypic QTL (phQTL) 
or classic QTL. Transcript abundance in the parents and conserva- 
tion of synteny with rice prioritized six genes as candidates for 
Rphq11, the phQTL of largest effect [26]. 
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In this chapter we discuss the steps needed to design an «QTL 
experiment to make use of systems genetics in evolutionary studies 
more concrete. As the prior we add information on plant host genes 
showing evidence of positive selection. 


2 Designing an Evolutionary xQTL Experiment 


An experimental design based on systems genetics can highlight 
sections of the genome showing correlation with an evolutionary 
trait. One such evolutionary trait of interest is plant resistance 
against pathogens. Plants have developed mechanisms to defend 
themselves against pests. When a pathogen, such as potato blight 
Phytophthora infestans, or a nematode, such as Meloidogyne hapla, 
infects a plant, it uses a battery of so-called effectors to help invade 
the plant. Some of these effector molecules act to dissolve cellulose 
[27]. Intriguingly, other molecules are involved in actively repro- 
gramming plant cells. Such plant-pathogen effectors have been 
shown to mimic plant transcription factors [28] and switch on 
genes that help the pathogen [29]. A susceptible plant allows the 
pathogen to suppress defense mechanisms and to change cell con- 
figuration. For example, the nematodes M. hapla and Globodera 
rostochiensis transform plant cells, so they become elaborate feeding 
structures. The genetics of this plant-pathogen interaction is poten- 
tially even relevant for human medicine, as an increased under- 
standing of host-pathogen relationships may help understand the 
workings of the innate immune system and nematode immunomo- 
dulation [30, 31]. The innate immune system, through plant resis- 
tance genes (R-genes, see Box 1), influences susceptibility to 
infections in all multicellular organisms and is a much older evolu- 
tionary mechanism than the advanced adaptive immune system 
found in higher organisms. 


Box 1: Adaptive evolution in R-genes 

Plant resistance genes (R-genes) are a homologous family of 
genes, formed by gene duplication events and hypothesized 
to be involved in an evolutionary arms race with pathogen 
effectors. R-genes are involved in recognizing specific patho- 
gens with cognate avirulence genes and initiating defense 
signaling that results in disease resistance [32]. R-genes are 
characterized by a molecular gene-for-gene interaction [33] 
in which a specific allele of a disease resistance gene recognizes 
an avirulence protein or pathogen allele. This specificity is 
often encoded, at least in part, in a relatively fast-evolving 
leucine-rich repeat (LRR) region [34], which consists of a 
varying number of LRR modules. Activation of at least some 
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Box 1: (continued) 
of these proteins is regulated in trans, as has been shown for 
RPM1 and RPS2 [35]. 

A single A. thaliana plant has about 150 R-genes, repre- 
senting a subset of R-genes in the overall population. The 
protein products of R-genes are involved in molecular inter- 
actions. They generally have a recognition site which can dock 
against, i.e., recognize, one or more specific molecule(s). The 
proteins encoded by the largest class of R-genes carry a 
nucleotide-binding site LRR domain (NB-LRR, also referred 
to as NB-ARC-LRR and NBS-LRR). NB-LRR R-genes can 
be further subdivided based on their N-terminal structural 
features into TIR-NB-LRR, which have homology to the 
Drosophila Toll and mammalian interleukin-1 receptors and 
CC-NB-LRR, which contain a putative coiled-coil motif 
[36]. The LRR domain appears to mediate specificity in path- 
ogen recognition, while the N-terminal TIR, or coiled-coil 
motif, is likely to play a role in downstream signaling 
[34]. When a molecule is docked, the R-protein is able to 
activate pathways in the cell, resulting in, for example, a 
hypersensitive response causing apoptosis and preventing 
spread of infection. 

Meanwhile, one single R-protein only recognizes one 
type of invading molecule. Therefore, through its R-genes, 
one individual plant only recognizes a limited number of 
strains of invading pathogens, as the individual pathogens 
have variation in effectors too. When a pathogen evolves to 
use nonrecognized effectors, the plant becomes susceptible. 
The success of plant defense is determined by both evolution 
and the variation of specificity in a population. Unlike the 
evolved mammal immune system, which can change in a living 
organism and learn about invasions “on the fly” [37], plant 
R-genes depend on the variation inside a gene pool to provide 
the resistance against a pathogen; see, for example, Holub 
et al. [38]. Even so, many genes involved in pathogen recog- 
nition undergo rapid adaptive evolution [39], and studies 
have found that A. thaliana R-genes show evidence of posi- 
tive selection, e.g., [40—42]. 


In this chapter we do not limit ourselves to (known) R-genes. 
Plants have evolved a complex array of chemical and enzymatic 
defenses, both constitutive and inducible, that are not involved in 
pathogen detection but whose effectiveness influences pathogene- 
sis and disease resistance. The genes underlying these defenses 
comprise a substantial portion of the host genome. Based on 


2.1 Create a Prior 
with PAML 
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genomic sequencing, it is estimated that some 14% of the 21,000 
genes in A. thaliana are related to defense against pathogens 
[43]. Most of these genes are not involved in direct pathogen 
detection, but their protein products interact directly with patho- 
gen proteins or protein products at the molecular level. Among 
these proteins, for example, are chitinases and endoglucanases that 
attack and degrade the cell walls of pathogens and which pathogens 
counterattack with inhibitors. Such systems of antagonistically 
interacting proteins provide the opportunity for molecular coevo- 
lution of individual systems of attack and resistance [39 ]. 

In this chapter we design an experiment to look for all gene 
families showing evidence of positive selection. This evidence of 
positive selection is the prior for eQTL analysis: combining known 
genomic locations of gene families with eQTL locations derived 
from gene expression variation in a host-pathogen interaction 
experiment, which hopefully results in zooming in on gene families 
involved in plant resistance. The prior adds statistical power in 
locating putative gene families involved in host-pathogen coevolu- 
tion (Fig. 1). Note that, in this chapter, the term “interaction” is 
used in two ways. The first is for QTL interaction, where two QTL 
on the genome interact statistically. The second is for host- 
pathogen gene-for-gene interaction, where gene products from 
different species interact physically. 


To create the prior, we use Ziheng Yang’s codeml implementation 
of phylogenetic analysis by maximum likelihood (PAML) 
[44]. PAML can find amino acid sites which show evidence of 
positive selection using dx/ds ratios, which is the ratio of 
non-synonymous over synonymous substitution (œ, see [44]). The 
calculation of maximum likelihood for multiple evolutionary mod- 
els is computationally expensive, and executing PAML over an 
alignment of a hundred sequences may take hours, sometimes 
days, on a PC. The software for generating the prior is prepackaged 
and makes up the workflow in Chap. 25, which includes BLAST 
[45], Clustal Omega [46], pal2nal [47], PAML [44], and 
BioRuby [48 ]. 

It is possible to find nonoverlapping large gene families by 
using BLASTCLUST, a tool that is part of the BLAST tool set 
[45]. After fetching the A. thaliana cDNA sequences from the 
Arabidopsis Information Resource (TAIR) [49], convert the 
sequences to a protein BLAST database format. Based on a homol- 
ogy criterion, the identity score and genes are clustered into puta- 
tive gene families by running BLASTCLUST with 70% amino acid 
sequence identity. Note that the percentage identity may not render 
all families and will leave out a number of genes. It is used here for 
demonstration purposes only. For A. thaliana such a genome-wide 
search finds at least 60 gene families, including some R-gene 
families. 
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2.2 Select a Suitable 
Experimental 
Population 


After aligning all family sequences, use PAML’s codeml to find 
evidence of positive selection in the gene families. Clustal Omega is 
used to align the amino acid sequences and create a phylogenetic 
tree. Next, pal2nal creates codon alignments, which can be used by 
PAML. Finally run PAML’s codeml MO-M3 (one ratio vs. nearly 
neutral) tests and M7-M8 (beta vs. beta + œ) tests in a computing 
cluster environment as shown in Chap. 25. 

An MO-M3 y2 test finds that 43 gene families (out of 60) show 
significant evidence of positive selection. M7-M8, meanwhile, finds 
35 gene families. Therefore, based on the described procedure, 
approximately half the families show significant evidence of positive 
selection and can be considered candidate gene families involved in 
host-pathogen interactions. Note that this number contains false 
positives because the evolutionary model may be too simplistic; see 
also [50]. Nevertheless, these candidate gene families can be used as 
an effective filter for further research. 

When a gene family displays evidence of positive selection, the 
genome locations can be used as a prior for systems genetics (see 
Fig. 1). With the full genome sequence of A. thaliana available, the 
location of gene families showing evidence of positive selection is 
known. For example, in the Columbia (Col-0) ecotype, the major- 
ity of the 149 R-genes are combined in clusters spreading 2-9 loci; 
the remaining 40 are isolated. Clusters are organized in so-called 
superclusters [36, 51]. Phylogenetic analysis shows that such clus- 
ters are the result of both old segmental duplications and recent 
chromosome rearrangements [36, 52]. 


To select a suitable experimental population, the choice of parents 
is key. Because we want a descriptive evolutionary prior based on 
gene families with known genome locations, we also need a 
sequenced genome, from one parent and ideally from both of the 
parental strains. The choice of parents for QTL analysis is normally 
based on large (classical) phenotypic differences. For testing path- 
ogen resistance, the choice would ideally be one susceptible parent 
and one resistant (nonsusceptible) parent. For eQTL, phylogenetic 
distance can be used, when there is no obvious phenotype. In 
general, it is a good idea to choose one or both parents from 
common library strains based on, for example, Columbia (Col-0), 
Landsberg erecta (Ler-0), Wassilewskija (Ws-0), or Kashmir 
(Kas-1). This is because a great number of experimental resources 
and online information will be available. In addition, a reference 
genetic background is provided in this way, which allows the com- 
parison of the effects of QTL and mutant alleles [53]. A number of 
RIL populations can be found through TAIR, a model organism 
database providing a centralized, curated gateway to Arabidopsis 
biology, research materials, and community [49]. 


2.3 Select anxQTL 
Technology 


2.4 Sizing the 
Experimental 
Population 
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A large part of published xQTL studies is based on gene expression 
eQTL partly because gene expression probe provides a direct geno- 
mic link. When it comes to selecting single-color or two-color 
arrays, one consideration may be that two-color arrays have higher 
efficiency when using a distant pair design [54]. 

Deep sequencing technology (RNA-seq, [55 ]) is affordable for 
eQTL studies. The main advantage over microarrays is improved 
signal-to-noise ratios and possibly improved coverage depending 
on the reference genome. Microarrays are noisy partly due to cross 
hybridization, e.g., [56], and have limited signal on low-abundance 
transcripts or expressors; both facts are detrimental to significance. 
Deep sequencing is no panacea, however, since it accentuates the 
high expressors. High expressors are expressed thousands of times 
higher than low expressors. Low expressors may lack significance 
for differential expression. Worse because deep sequencing is sto- 
chastic, many low expressors may even be absent. Another point to 
consider is that currently at least 1 in 1000 nucleotide base pairs is 
misread, which makes it harder to disentangle error from genetic 
variation. Only when a sequence polymorphism is measured many 
times (say 20x), it can be considered to represent genetic variation. 

Also a choice for a certain eQTL technology should take into 
account that, when looking at differential gene expression analysis, 
different microarray platforms agree with each other, but overlap 
between microarray and deep sequencing is much lower, suggest- 
ing a technical bias [57]. 

For an example of a metabolite mQTL study, see Keurentjes 
et al. [58] and Fu et al. [59]. For a study integrating eQTL, pQTL, 
mQTL, and classical phenotypic QTL, see Fu et al. [60] and Jansen 
et al. [13]. 


The size of the experimental population should be large enough to 
give informative results. For classical QTL analysis, the sizing may 
be assisted using estimates of total environmental variance and the 
total genetic variance derived from the accessions, selected as par- 
ents. Roughly, population sizes of 200 RILs, without replications, 
will allow detection of large-effect QTL with an explained variance 
of 10% in confidence intervals of 10-20 cM. Detection of small- 
effect QTL or mapping accuracy below 5% requires increasing the 
population size to at least 300 RILs [53]. It is important to note 
that QTL mapping accuracy is a function of marker density and 
population size. The number of strains to use differs between 
inbred lines. The promise of extreme dense marker maps, such as 
delivered by SNPs, does not automatically translate to higher accu- 
racy. It is the number of recombination events in the population for 
a particular genomic region that limits QTL interval size. In fact, 
current marker maps, in the order of thousands of (evenly spread) 
markers per genome, suite population sizes of a few hundred RILs. 
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2.5 Analyzing the 
XQTL Experiment with 
R/qtl 


It is a fallacy, for example, to expect higher mapping power when 
combining an ultradense SNP map with just 20 individuals. 

For high-throughput xQTL, the experimental population 
should be sized against an acceptable false discovery rate (FDR), 
minimizing for type I and type II errors. This can be achieved using 
a permutation strategy to assess statistical significance, maintaining 
the correlation of the expression traits while destroying any genetic 
linkages or associations in natural populations: marker data is per- 
muted while keeping the correlation structure in the trait data, such 
as presented by Breitling et al. [61]. Unfortunately, this informa- 
tion differs for every experiment and is only available afterward. 
Analyzing a similar experiment, using the same tissue and data 
acquisition technology, may give an indication [60], but when no 
such material is available, a crude estimate may be had by taking the 
thresholds of a (classic) single-trait QTL experiment and adjusting 
that for multiple testing by the Bonferroni correction (minimize 
type I errors) or Benjamini- Hochberg correction (minimize type II 
errors). Note that Bonferroni results in a very conservative 
estimate. 


R/qtl is extensible, interactive free software for the mapping of 
xQTL in experimental crosses. It is implemented as an add-on 
package for the widely used statistical language/software R. Since 
its introduction, R/qtl has become a reference implementation 
with an extensive guide on QTL mapping [62]. 

R/qtl includes multiple QTL mapping (MQM), as described in 
[10], an automated procedure, which combines the strengths of 
generalized linear model regression with those of interval mapping. 
MQM can handle missing data by analyzing probable genotypes. 
MQM selects important marker cofactors by multiple regression 
and backward elimination. QTL are moved along the chromo- 
somes using these preselected markers as cofactors. QTL are inter- 
val mapped using the most informative model through maximum 
likelihood. MQM for R/qtl brings the following advantages to 
QTL mapping: (1) higher power, as long as the QTL explain a 
reasonable amount of variation; (2) protection against overfitting, 
because MQM fixes the residual variance from the full model; 
(3) prevention of ghost QTL detection (between two QTL in 
coupling phase); and (4) detection of negating QTL (QTL in 
repulsion phase) [10]. 

MQM for R/qtl brings additional advantages to systems genet- 
ics data sets with hundreds to millions of traits: (5) a pragmatic 
permutation strategy for control of the FDR and prevention of 
locating false QTL hot spots, as discussed above; (6) high- 
performance computing by scaling on multi-CPU computers, as 
well as clustered computers, by calculating phenotypes in parallel, 
through the message passing interface (MPI) of the parallel package 
for R; and (7) visualizations for exploring interactions in a genomic 


2.6 Matching the 
Prior 


2.7 Combining xQTL 
Results: Causality and 
Network Inference 
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circle plot and cis- and trans-regulation. MQM comes with a 
40-page tutorial for MQM and is part of the software distribution 
of R/qtl [10, 63]. 


After detecting eQTL, we have a map of gene regulation in the form 
of a cis-trans map. When taking a priori information into account, 
i.e., genomic locations derived through other methods, we can 
potentially match the genomic locations of genes and gene families 
with the eQTL cis-trans map. Until now, there has been no com- 
bined QTL and evolutionary study, involving PAML, for host- 
pathogen relationships in plants, though they have been conducted 
separately. 


In addition to identifying eQTL or «QTL, it is possible to think in 
terms of grouping related traits by correlations. Molecular and 
phenotypic traits can be informative for inferring underlying molec- 
ular networks. When two independent non-correlated traits share 
multiple QTL, inference of a functional relationship is possible 
(Fig. 1b). Thus, distinguishing trait causality, reactivity, or indepen- 
dence can be based upon logic involving underlying QTL. This was 
the basic idea in Jansen and Nap 2001 [64]. Later, people started to 
use biological variation as an extra source for reasoning because if A 
affects B, biological variation in trait A is propagated to B and not 
vice versa. This assumes there is no hidden trait C affecting both A 
and B; see also Li et al. [15]. 

Mapping QTL for thousands of molecular phenotypes is the 
first step in attempting to reconstruct gene networks. Not only can 
network reconstruction be used within a particular layer, say within 
eQTL analysis, i.e., transcript data only, but also across layers. Such 
interlevel (system) analysis integrates transcript eQTL, protein 
PQTL, metabolite mQTL, and classical QTL [13]. 

The examination of pairwise correlation between traits can lead 
to the hypothesis of a functional relationship when that correlation 
is high. Beyond the detected QTL, the correlation between resi- 
duals among traits, after accounting for QTL effects, or correla- 
tions between traits conditional on other traits is further evidence 
for a network connection. To infer directional effects, it is necessary 
to analyze the correlations among pairs of traits in detail. If trait A 
maps to a subset of the QTL of trait B, then the common QTL can 
be taken as evidence for their network connection, while the dis- 
tinct QTL can be used to infer the direction (Fig. 1b), unless all the 
common QTL have widespread pleiotropic effects, which is when a 
single gene influences multiple traits. If traits A and B have com- 
mon QTL, without QTL that are distinct, then the inference is 
more complicated, and further analysis is needed to discriminate 
pleiotropy from any of the possible orderings among traits [13, 15]. 

Li et al. [15] point out that, despite the exciting possibilities of 
correlation analysis, extreme caution is advised, especially in 
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3 Discussion 


intralevel analyses, owing to the potential impact of correlated 
measurement error (leading to false-positive connections). By 
introducing a prior, however, causal inference becomes feasible 
for realistic population sizes [15]. The outcome of a causal infer- 
ence on two traits sharing a common QTL may be either that one is 
causal for the other or that they are independent. In the first case, 
QTL-induced variation is propagated from one trait to the other, 
while in the latter case, the two traits may even be regulated by 
different genes or polymorphisms within the QTL region, and their 
apparent relationship (correlation) is explained by linkage disequi- 
librium and not by a shared biological pathway [15]. 


A QTL is a statistical property connecting genotype with pheno- 
type. In this chapter, we reviewed studies which, with various 
degrees of success, combine some type of prior information with 
xQTL. We propose that a search for genome-wide evidence of 
positive selection can produce a valid and interesting prior for 
xQTL analysis. This is achieved by combining information of geno- 
mic locations of putative gene families, possibly involved in plant- 
pathogen interactions, with QTL locations derived from a systems 
genetics experiment. Both the eQTL example and the search for 
genome-wide evidence of positive selection pressure are essentially 
exploratory and result in a list of putative genes, or gene families, 
with known genomic locations. The combined information yields 
candidate genes and pathways that are under positive selection 
pressure and, potentially, involved in host-pathogen interactions. 
We explain that it is possible to design an eQTL experiment using 
existing experimental populations, e.g., using an A. thaliana RIL 
population, and analyze results with existing free and open-source 
software, such as the R/qtl tool set. 

Systems genetics bridges the study of quantitative traits with 
molecular biology and gives new momentum to QTL population 
studies. Genetic variation at multiple loci in combination with 
environmental factors can induce molecular or phenotypic varia- 
tion. Variation may manifest itself as linear patterns among traits at 
different levels that can be deconstructed. Correlations can be 
attributed to detectable QTL and a logical framework based on 
common and distinct QTL and propagation of biological variation, 
which can be used to infer network causality, reactivity, or indepen- 
dence [15]. Unexplained biological variation can be used to infer 
direction between traits that share a common QTL and have no 
distinct QTL, though it may be difficult to separate biological from 
technical variation. Prior knowledge and complementary experi- 
ments, such as deletion mapping followed by independent gene 
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expression studies between parental lines, may validate or disprove 
implicated network connections [65 ]. 

Evolutionary systems genetics can help dissect the underlying 
genetics of pathogen susceptibility in plants. Where “evolutionary 
genetics” describes how evolutionary forces shape biodiversity, as 
observed in nature, “evolutionary systems genetics” describes how 
phenotype variation in a population is formed by genotype varia- 
tion between, for example, host and pathogen involved in an evo- 
lutionary arms race. 

For purpose of online analysis we created GeneNetwork.org 
(GN) [7], a free and open-source (FOSS) framework for web-based 
genetics that can be deployed anywhere. GN allows biologists to 
upload high-throughput experimental data, such as expression data 
from microarrays and RNA-seq, and also classical phenotypes, such 
as disease phenotypes. These phenotypes can be mapped interac- 
tively against genotypes using embedded tools, such as R/QTL 
[10] for model organisms and FaST-LMM [66] and GEMMA [67] 
which are suitable for human populations and outbred crosses, such 
as the mouse diversity outcross. Interactive D3 graphics are 
included from R/qtl charts, and presentation-ready figures can be 
generated. Recently we have added functionality for phenotype 
correlation [68], correlation trait loci [16], and network analysis 
[14]. For examples on using GeneNetwork, see also Mulligan 
et al. [8]. 

If you want to know more about eQTL, we suggest the review 
by Gilad et al. [23], which also discusses eQTL in genome-wide 
association studies (GWAS), useful in situations where experimen- 
tal crosses are not available (such as with many pathogens and 
humans). For further reading on R-gene evolution, we recommend 
Bakker et al. [34]. For R/qtl analysis, we recommend the R/qtl 
guide [62] and our MQM tutorial online [63]. For integrating 
different xQTL methods and causal inference, we recommend Li 
et al. [15] and Jansen et al. [13]. 


l. What is an eQTL, and why does it present two genomic 
locations? 


2. Can a prior, as used here, really add statistical power, or is it no 
more than circumstantial evidence? 


3. When designing an evolutionary systems genetics experiment, 
what are the steps to consider? 


4. How can causality be inferred from QTL networks? 
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Abstract 


Biological databases are growing at an exponential rate, currently being among the major producers of Big 
Data, almost on par with commercial generators, such as YouTube or Twitter. While traditionally biological 
databases evolved as independent silos, each purposely built by a different research group in order to answer 
specific research questions; more recently significant efforts have been made toward integrating these 
heterogeneous sources into unified data access systems or interoperable systems using the FAIR principles 
of data sharing. Semantic Web technologies have been key enablers in this process, opening the path for new 
insights into the unified data, which were not visible at the level of each independent database. In this 
chapter, we first provide an introduction into two of the most used database models for biological data: 
relational databases and RDF stores. Next, we discuss ontology-based data integration, which serves to 
unify and enrich heterogeneous data sources. We present an extensive timeline of milestones in data 
integration based on Semantic Web technologies in the field of life sciences. Finally, we discuss some of 
the remaining challenges in making ontology-based data access (OBDA) systems easily accessible to a larger 
audience. In particular, we introduce natural language search interfaces, which alleviate the need for 
database users to be familiar with technical query languages. We illustrate the main theoretical concepts 
of data integration through concrete examples, using two well-known biological databases: a gene expres- 
sion database, Bgee, and an orthology database, OMA. 


Key words Data integration, Ontology-based data access, Knowledge representation, Query proces- 
sing, Keyword search, Relational databases, RDF stores 


Abbreviations 

ABox Assertional box 

Bgee dataBase for Gene Expression Evolution, https://bgee.org/ 
FK Foreign key in a relational database 

HBB Hemoglobin unit beta gene 

IRI Internationalized Resource Identifier 


OBDA Ontology-based data access 
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OMA Orthologous Matrix, a database for the inference of orthologs among complete 
genomes.—https://omabrowser.org, SPARQL ` endpoint: ` https://sparql. 
omabrowser.org/sparql 


PK Primary key in a relational database 

PK-FK Primary key-foreign key relationship; enables joining two tables in a relational 
database 

RDB Relational database 

RDF Resource Description Framework 

SODA Search Over Relational Databases [21] 

SQL Structured Query Language 

SPARQL SPARQL Protocol and RDF Query Language 

TBox Terminological box 

URI Uniform Resource Identifier 


1 Introduction 


Biological databases have grown exponentially in recent decades, 
both in number and in size, owing primarily to modern high- 
throughput sequencing techniques |1]. Today, the field of geno- 
mics is almost on par with the major commercial generators of Big 
Data, such as YouTube or Twitter, with the total amount of 
genome data doubling approximately every 7 months [2]. While 
most biological databases have initially evolved as independent 
silos, each purposely built by a different research group in order 
to collect data and respond to a specific research question, more 
recently significant efforts have been made toward integrating the 
different data sources, with the aim of enabling more powerful 
insights from the aggregated data, which would not be visible at 
the level of individual databases. 

Let us consider the following example. An evolutionary biolo- 
gist might want to answer the question “What are the human-rat 
orthologs, expressed in the liver, that are associated with leuke- 
mia?”. Getting an answer for this type of question usually requires 
information from at least three different sources: an orthology 
database (e.g., OMA [3], OrthoDB [4], or EggNog [5]); a gene 
expression database, such as Bgee [6]; and a proteomics database 
containing disease associations (e.g., UniProt [7]). In the lack of a 
unified access to the three data sources, obtaining this information 
is a largely manual and time-consuming process. First, the biologist 
needs to know which databases to search through. Second, depend- 
ing on the interface provided by these databases, he or she might 
need to be familiar with a technical query language, such as SQL or 
SPARQL (note: a list of acronyms is provided at the beginning of 
this chapter). At the very least, the biologist is required to know the 
specific identifiers (IDs) and names used by the research group that 
created the database, in order to search for relevant entries. An 
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integrated view, however, would allow the user to obtain this 
information automatically, without knowing any of the details 
regarding the structure of the underlying data sources—nor the 
type of storage these databases use—and eventually not even spe- 
cific IDs (such as protein or gene names). 

Biological databases are generally characterized by a large het- 
erogeneity, not only in the type of information they store but also in 
the model of the underlying data store they use—examples include 
relational databases, file-based stores, graph based, etc. Examples of 
databases considered fundamental to research in the life sciences 
can be found in the ELIXIR Europe’s Core Data Resources, avail- 
able online at https://www.elixir-europe.org/platforms/data. In 
this chapter we will mainly discuss two types of database models: 
the relational model (Ge, relational databases) and a graph-based 
data model, RDF (the Resource Description Framework). 

Database systems have been around since arguably the same 
time as computers themselves, serving initially as “digitized” copies 
of tabular paper forms, for example, in the financial sector, or for 
managing airline reservations. Relational databases, as well as the 
mathematical formalism underlying them, namely, the relational 
algebra, were formalized in the 1970s by E.F. Codd, in a founda- 
tional paper that now has surpassed 10,000 citations [8]. The 
relational model is designed to structure data into so-called tuples, 
according to a predefined schema. Tuples are stored as rows in 
tables (also called “relations” ). Each table usually defines an entity, 
such as an object, a class, or a concept, whose instances (the tuples) 
share the same attributes. Examples of relations are “Gene”, 
“Protein”, “Species”, etc. The attributes of the relation will repre- 
sent the columns of the table, for example, “gene name.” Further- 
more, each row has a unique identifier. The column 
(or combination of columns) that stores the unique identifier is 
called a primary key and can be used not only to uniquely identify 
rows within a table but also to connect data between multiple tables, 
through a Primary Key-Foreign key relationship. Doing such a 
connection is called a join. In fact, a join is only one of the opera- 
tions defined by relational algebra. Other common operations 
include projection, selection, and others. The operands of relational 
algebra are the database tables, as well as their attributes, while the 
operations are expressed through the Structured Query Language 
(SQL). For a more in-depth discussion on relational algebra, we 
refer the reader to the original paper by E.F. Codd [8]. 

This chapter is structured as follows. In Sect. 2, we give a brief 
introduction to relational databases, through the concrete example 
of the Bgee gene expression database. We introduce the basics of 
Semantic Web technologies in Sect. 3. Readers who are already 
familiar with the Semantic Web stack might skip Sect. 3 and jump 
directly to Sect. 4, which presents an applied use case of Semantic 
Web technologies in the life sciences: modeling the Bgee and OMA 
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databases. Section 5 represents the core of this chapter. Here, we 
present ontology-based data integration (Sect. 5.1) and illustrate it 
through the concrete example of a unified ontology for Bgee and 
OMA (Sect. 5.2), as well as the mechanisms required to further 
extend the integrated system with other heterogeneous sources 
such as the UniProt protein knowledge base (Sect. 5.3). We intro- 
duce natural language interfaces, which enable easy data access even 
for nontechnical users, in Sect. 5.4. We present an extensive time- 
line of milestones in data integration based on Semantic Web 
technologies in the field of life sciences in Sect. 6. Finally, we 
conclude in Sect. 7. 


2 Modeling a Biological Database with Relational Database Technology 


GlobalCond 


GlobalConditionID 
AnatEntityID 
SpeciesiD 
StagelD 

sex 

strain 
affymetrixMaxRank 
tmaSeqMaxRank 
estMaxRank 
inSituMaxRank 


In this section we will demonstrate how to model a biological 
database with relational database technology. 

Figure 1 illustrates the data model of a sample extracted from 
the Bgee database. The sample contains five tables and their rela- 
tionships, shown as arrows, where the direction of the arrow is 
oriented from the foreign key of one table to the primary key of a 
related one. For example, the Primary Key (PK) of the Species table 
is the SpeciesID. Following the relationships highlighted in bold, we 
see that the SpeczesID also appears in the two tables connected to 
Species: GlobalCond and Gene. In these tables, the attribute plays the 


Species Gene 
SpeciesID J! bgeeGenelD 
genus GenelD 
Species GeneName 


SpeciesCommonName GeneDescription 
SpeciesDisplayOrder SpeciesID 
taxonID GeneBioTypelD 
OMAParentNodelD 
ensembiGene 


GeneMappedToGenelDCount 


genomeFilePath 
genomeVersion 


dataSourcelD 


genomeSpeciesID 


Stage 


++) StagelD 4 


AnatEntity 
StageName 


AnatEntityID 
AnatEntityName 
AnatEntityDescription 
cl startStagelD 

ei endStagelD 


StageDescription 
StageLeftBound 
StageRightBound 
StageLevel 
groupingStage 


Fig. 1 Sample relational database (extracted from the gene expression database Bgee) 
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2.1 Limitations of 
Relational Databases 
and Emerging 
Solutions for Data 
Integration 


role of a Foreign Key (FK). The PK-FK relationships allow com- 
bining or aggregating data from related tables. For example, by 
joining Species and Gene, through the SpeciesID, we can find to 
which species a gene belongs. Concretely, let’s assume we want to 
find the species where the gene “HBB” can be found. Given that 
this information is stored in the SpeciesCommon Name attribute, we 
can retrieve it through the following SQL query: 


SELECT SpeciesCommonName from Species JOIN Gene 
WHERE Gene.GeneName = ‘HBB’ and Species.SpeciesID = Gene. 


SpeciesID 


This query enables retrieving (via the “SELECT” keyword) the 
attribute corresponding to the species name (SpeciesCommon- 
Name) by joining the Species and Gene tables, based on their 
primary key-foreign key relationship, namely, via the SpeciesID, on 
the condition that the GeneName exactly matches “HBB.” For a 
more detailed introduction to the syntax and usage of SQL, we 
refer the reader to an online introductory tutorial [9], as well as the 
more comprehensive textbooks [10, 11]. 

Taking this a step further, we can imagine the case where a 
second relational database also stores information about genes, but 
perhaps with some additional data, such as associations with dis- 
eases. Can we still combine information across these distinct data- 
bases? Indeed, as long as there is a common point between the 
tables in the two databases, such as the GeneID or the SpectesID, it 
is usually possible to combine them into a single, federated database 
and use SQL to query it through federated joins. An example of 
using federated databases for biomedical data is presented in [12]. 


So far, we have seen that relational databases are a mature, highly 
optimized technology for storing and querying structured data. 
Also, combined with a powerful and expressive query language, 
SQL, they allow users to federate (join) data even from different 
databases. 

However, there are certain relationships that are not natural for 
relational databases. Let us consider the relationship “hasOrtho- 
log”. Both the domain and the range of this relationship, as defined 
in the Orthology Ontology [13], are the same—a gene. For exam- 
ple, the hemoglobin (HBB) gene in human has the Hbb-bt ortho- 
logous gene in the mouse (expressed via the relation asOrtholog). 
In the relational database world, this translates into a so-called self- 
join. As the name suggests, this requires joining one table—in this 
case, Gene—with itself, in order to retrieve the answer. These types 
of “self-join” relations, while frequent in the real world (e.g., a 
manager of an employee is also an employee, a friend of a person 
is also a person, etc.), are inefficient in the context of relational 
databases. While there are sometimes ways to avoid self-joins, these 
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require even more advanced SQL fluency on the part of the 
programmer [14]. 

Moreover, relational databases are typically not well-suited for 
applications that require frequent schema changes. Hence, NoSQL 
stores have gained widespread popularity as an alternative to tradi- 
tional relational database management systems [15-17 ]. These sys- 
tems do not impose a strict schema on the data and are therefore 
more flexible than relational databases in the cases where the struc- 
ture of the data is likely to change over time. In particular, graph 
databases, such as Virtuoso [18], are very well suited for data 
integration, as they allow easily combining multiple data sources 
into a single graph. We discuss this in more detail in Sect. 3. 

These and other considerations have led to the vision of the 
Semantic Web, formalized in 2001 by Tim Berners Lee et al. 
[19]. At a high-level, the Semantic Web allows representing the 
semantics of data in a structured, easy to interlink, machine- 
readable way, typically by use of the Resource Description Frame- 
work (RDF)—a graph-based data model. The gradual adoption of 
RDF stores, although widespread in the Web context and in the life 
sciences in particular, did not replace relational databases alto- 
gether, which lead to a new challenge: how will these heteroge- 
neous data sources now be integrated? 

Initial integration approaches in the field of biological data- 
bases have been largely manual: first, many of them (either rela- 
tional or graph-based) have included cross-references to other 
sources. For example, UniProt contains links to more than 
160 other databases. However, this raises a question for the user: 
which of the provided links should be followed in order to find 
relevant connections? While a user can be assumed to know the 
contents of a few related databases, we can hardly expect anyone to 
be familiar with more than 160 of them! To avoid this problem, 
other databases have chosen an orthogonal approach: instead of 
referencing links to other sources, simply copy the relevant data 
from those sources into the database. This approach also has a few 
drawbacks. First, it generates redundant data (which might result in 
significant storage space consumption), and, most importantly, it 
might lead to the use of stale, outdated results. Moreover, this 
approach is contradictory to best practices of data warehousing 
used widely across various domains in industry. For a discussion 
on this, we refer the reader to [20]. 

Databases such as UniProt are highly comprehensive, with new 
results being added to each release, results that may sometimes even 
contradict previous results. Duplication of this data into another 
database can quickly lead to missing out the most recent informa- 
tion or to high maintenance efforts required to keep up with the 
new changes. In the following sections, we discuss an alternative 
approach: integrating heterogeneous data sources through the use 
of a unifying data integration layer, namely, an integrative ontology, 
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that aligns, but also enriches the existing data, with the purpose of 
facilitating knowledge discovery. 

Throughout the remainder of this chapter, we will combine 
theoretical aspects of data integration with concrete examples, 
based on our SODA project [21], as well as from our ongoing 
research project, Bio-SODA [22], where we are currently building 
an integrated data access system for biological databases (starting 
with OMA and Bgee), using a natural language search interface. In 
the context of this project, Semantic Web technologies, such as 
RDF, are used to enhance interoperability among heterogeneous 
databases at the semantic level (e.g., RDF graphs with predefined 
semantics). Moreover, currently, several life science and biomedical 
databases such as OMA [3], UniProt [7], neXtProt [22], the 
European Bioinformatics Institute (EMBL-EBI) RDF data [24], 
and the WorldWide Protein Data Bank [25] already provide RDF 
data access, which also justifies an RDF-based approach to enable 
further integration efforts to include these databases. A recent 
initiative for (biological) data sharing is based on the FAIR princi- 
ples [26], aiming to make data findable, accessible, znteroperable, 
and re-usable. 


3 Semantic Web Technologies 


3.1 Unique Resource 
Identifier (URI) 


The Semantic Web, as its name shows, emerged mainly as a means 
to attach semantics (meaning) to data on the Web [19]. In contrast 
to relational databases, Semantic Web technologies rely on a graph 
data model, in order to enable interlinking data from disparate 
sources available on the Web. Although the vision of the Semantic 
Web still remains an ideal, many large datasets are currently pub- 
lished based on the Linked Data principles [27] using Semantic 
Web technologies (e.g., RDF). The Linked Open Data Cloud 
illustrates a collection of a large number of different resources 
including DBPedia, UniProt, and many others. 

In this section, we will describe the Semantic Web (SW) stack, 
focusing on the technologies that enhance data integration and 
enrichment. For a more complete description of the SW stack, we 
refer the reader to the comprehensive introductions in [28-30]. 

The Semantic Web stack is presented in Fig. 2. We will focus on 
the following standards or layers of the stack: URI, the syntax layer 
(e.g., Turtle (TTL), an RDF serialization format), RDF, OWL, 
RDFS, and SPARQL. These layers are highlighted in gray in Fig. 2. 


A Uniform Resource Identifier (URI) is a character sequence that 
identifies an abstract or physical resource. A URI is classified as a 
locator, a name, or both. The Uniform Resource Locators (URLs) 
are a subset of URIs that, in addition to identifying a resource, 
provide a means of locating the resource by describing its primary 
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Fig. 2 The Semantic Web stack modified from [31] 
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http://purl.uniprot.org/proteomes/UP 44#Genom 


Fig. 3 An example of a UniProt URI with a fragment 


access or network “location.” For example, https://bgee.org is a 
URI that identifies a resource (i.e., the Bgee gene expression web- 
site), and it implies solely a representation of this resource (i.e., an 
HTML Web page). This resource is accessible through the HTTPS 
protocol. 

The Uniform Resource Name (URN) is also a URI that refers 
to both the “urn” scheme [32], which are URIs required to remain 
globally unique and persistent even when the resource does not 
exist anymore or becomes unavailable, and to any other URI with 
the properties of a name. For example, the URN urn:isbn:978-1- 
61779-581-7 is a URI that refers to a previous edition of this book 
by using the International Standard Book Number (ISBN). How- 
ever, no information about the location and how to get this 
resource (book) is provided. 

The URI syntax consists of a hierarchical sequence of compo- 
nents referred to as the scheme, authority, path, query, and frag- 
ment [33]. Figure 3 describes a UniProt URI that includes these 
components. 

An individual scheme does not have to be classified as being just 
one of “name” or “locator.” Instances of URIs from any given 
scheme may have the characteristics of names (URN) or locators 
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3.2 Resource 
Description 
Framework (RDF) 


(Subject) Predicate Corea) 


Fig. 4 An RDF graph with two nodes (subject and object) and an edge connecting 
them (predicate) 


(URL) or both (URN + URL). Further examples of URIs with 
variations in their syntax components are: 


e ftp:/ftp.bgee.org/current/download/calls/expr_calls/Sus_ 
scrofa_expr_simple_development.tsv.zip 


e http://www.ensembl.org/Multi/Search/Results?q=BRCA2 
e mailto:Bgee@sib.swiss 

e urn:miriam:pubmed:26615188 

e https://www.ncbi.nlm.nih.gov/pubmed/26615188 


The Resource Description Framework (RDF) is a framework for 
describing information about resources in the World Wide Web, 
which are identified with URIs. In the previous section, we have 
seen that data in relational databases is organized into tables, 
according to some predefined schema. In contrast, in RDF stores, 
data is mainly organized into triples, namely, <sudyject, predicate, 
object>, similarly to how sentences in natural language are 
structured. An informal example would be: <Bob, isFriendOf, 
Alice>. A primer on triples and the RDF data model, using this 
simple example, is available online [34]. Figure 4 illustrates the 
RDF triple: the subject represents the resource being described, 
the predicate is a property of that resource, and finally the object is 
the value of the property (De, an attribute of the subject). 

Triples can be defined using the RDF. The data store for RDF 
data is also called a “triple store.” Moreover, in analogy to the data 
model (or the schema) of a relational database, the high-level 
structure of data in a triple store can be described using an ontology. 
According to Studer et al. [35], an ontology is a formal, explicit 
specification of a shared conceptualization. “Formal” refers to the 
fact that the expressions must be machine readable: hence, natural 
language is excluded. In this context, we can mention description 
logic (DL)-based languages [36], such as OWL 2 DL (see Sect. 3.3 
for further details) to define ontologies. A DL ontology is the 
equivalent of a knowledge base (KB). A KB is mainly composed 
of two components that describe different statements in ontolo- 
gies: the terminological box (TBox, i.e., the schema) and the 
assertional box (ABox, i.e., the data). Therefore, the conceptual 
statements form the set of TBox axioms, whereas the instance level 
statements form the set of ABox assertions. To exemplify this, we 
can mention the following DL axioms: Man = Human N Male 
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(a TBox axiom that states a man is a human and male) and john: 
Man (an ABox assertion that states john is an instance of man). 

Given that one of the goals of the Semantic Web is to assign 
unambiguous names to resources (URIs), an ontology should be 
more than a simple description of data in a particular triple store. 
Rather, it should more generally serve as a description of a domain, 
for instance, genomics (see Gene Ontology [37 ]) or orthology (see 
Orth Ontology [13]). Different instantiations of this domain, for 
example, by different research groups, should reuse and extend this 
ontology. Therefore, constructing good ontologies requires careful 
consideration and agreement between domain specialists, with the 
goal of formally representing knowledge in their field. As a conse- 
quence, ontologies are usually defined in the scope of consor- 
tiums—such as the Gene Ontology Consortium [38] or the 
Quest for Orthologs Consortium [39]. A notable collaborative 
effort is the Open Biological and Biomedical Ontology (OBO) 
Foundry [40]. It established principles for ontology development 
and evolution, with the aim of maximizing cross-ontology coordi- 
nation and interoperability, and provides a repository of life science 
ontologies, currently, including about 140 ontologies. 

To give an example of RDF data in a concrete life sciences use 
case, let us consider the following RDF triples, which illustrate a 
few of the assertions used in the OMA orthology database to 
describe the human hemoglobin protein (“HBB”), using the first 
version of the ORTH ontology [13]: 


oma: PROTEIN_HUMAN04027 rdf:type orth:Protein. 

oma: PROTEIN_HUMAN04027 oma:geneName “HBB”. 

oma: PROTEIN_HUMAN04027 biositemap:description “Hemoglobin 
subunit beta". 

oma: PROTEIN_HUMAN04027 obo:RO_0002162 <http://www.uniprot. 
org/taxonomy/9606>. 


This simple example already illustrates most of the basics of 
RDF. The instance that is being defined—the HBB protein in 
human—has the following URI in the OMA RDF store: http: // 
omabrowser.org/ontology/oma#PROTEIN_HUMAN04027 

The URI is composed of the OMA prefix, http: //omabrowser. 
org/ontology/oma# (abbreviated here as “oma:”), and a fragment 
identifier, PROTEIN_-HUMAN04027. The first triple describes 
the type of this resource—namely, an orth:Protein—based on the 
Orthology Ontology, prefixed here as “orth:,” http://purl.org/ 
net/orth#. As mentioned previously, this is a higher-level ontology, 
which OMA reuses and instantiates. It is important to note that 
other ontologies are used as well in the remaining assertions: for 
example, the last triple references the UniProt taxonomy ID 9606. 
This is based on the National Center for Biotechnology Informa- 
tion (NCBI) organismal taxonomy [41]. If we follow the link in a 


Semantic Integration and Enrichment of Heterogeneous Biological Databases 665 


3.3 RDF Schema 
(RDFS) 


Web browser, we see that it identifies the “Homo sapiens” species, 
while the property obo:RO_0002162 (i.e., http://purl.obolibrary. 
org/obo/RO_0002162) simply denotes “in taxon” in OBO 
[40]. Lastly, the concept also has a human-readable description, 
“Hemoglobin subunit beta.” 


RDF Schema (RDES) provides a vocabulary for modeling RDF 
data and is a semantic extension of RDF. It provides mechanisms 
for describing groups (i.e., classes) of related resources and the 
relationships between these resources. The RDFS is defined in 
RDF. The RDFS terms are used to define attributes of other 
resources such as the domains (rdfs:domain) and ranges (rdf: 
range) of properties. Moreover, the RDFS core vocabulary is 
defined in a namespace informally called rdfs here, and it is conven- 
tionally associated with the prefix rdfs:. That namespace is identified 
by the URI http://www.w3.org/2000/01/rdf-schema#. 

In this section, we will mostly focus on the RDF and RDFS 
terms used in this chapter. Further information about RDF/RDFS 
terms is available in [42]. 


e Classes 


— rdfs:Resource—all things described by RDF are called 
resources, which are instances of the class rdfs:Resource De, 
rdfs:Resource is an instance of rdfs:Class). 


— rdfs:Class is the class of resources that are RDF classes. 
Resources that have properties (attributes) in common may 
be divided into classes. The members of a class are instances. 


— rdf:Property is a relation between subject and object 
resources, i.e., a predicate. It is the class of RDF properties. 


— rdfs:Literal is the class of literal values such as textual strings 

and integers. rdfs:Literal is a subclass of rdfs:Resource. 
e Properties 

— rdfs:range is an instance of rdf:Property. It is used to state 
that the values of a property are instances of one or more 
classes. For example, orth:hasHomolog rdfs:range orth:Sequen- 
ce Unit (see Fig. 5a). This statement means that the values of 
orth:hasHomolog property can only be instances of orth: 
Sequence Unit class. 


— rdfs:domain is an instance of rdf:Property. It is used to state 
that any resource that has a given property is an instance of 
one or more classes. For example, orth:hasHomolog rdf: 
domain orth:SequenceUnit (see Fig. 5b). This statement 
means that resources that assert the orth:hasHomolog prop- 
erty must be instances of orth:Sequence Unit class. 


— rdf:type is an rdf?Property that is used to state that a resource 
is an instance of a class. 
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Fig. 5 Examples of RDF/RDFS statements 


— rdfs:subClassOf is an rdf?Property to assert that all instances 
of one class are instances of another. For example, if Cl rdfs: 
subClassOf C2 then an instance of C1 is also an instance of C2 
but not vice versa. 


— rdfs:subPropertyOf is used to state that all resources related 
by one property (i.e., the subject of rdfs:subPropertyOf) are 
also related by another (i.e., the object of rdfs:subProper- 
tyOf, the “super-property”). For example, all orthologous 
relations are also homologous relations. Because of this, in 
the latest release candidate of the Orthology Ontology [13], 
it is stated that orth:hasOrtholog is a sub-property of orth: 
hasHomolog. Figure 5c illustrates this statement. 


3.4 Web Ontology The first level above RDF/RDFS in the Semantic Web stack (see 

Language (OWL) Fig. 2) is an ontology language that can formally describe the 
meaning of resources. If machines are expected to perform useful 
reasoning tasks on RDF data, the language must go beyond the 
basic semantics of RDF Schema [43]. Because of this, OWL and 
OWL 2 (i.e., Web Ontology languages) include more terms for 
describing properties and classes, such as relations between classes 
(e.g., disjointness, owl:disjoint With), cardinality (e.g., “exactly 2,” 
owl:cardinality), equality (De, owl:equivalentClass), richer typing 
of properties, characteristics of properties (e.g., symmetry, owl: 
SymmetricProperty), and enumerated classes (i.e., owl:oneOf). The 
owl: prefix replaces the following URI namespace: http://www.w3. 
org /2002/07 /owl#. 

As a full description of OWL and OWL 2 is beyond the scope of 
this chapter, we refer the interested reader to [44, 45]. In the 
following, we focus solely on some essential modeling features 
that the OWL languages offer in addition to RDF/RDFS 


vocabularies. 


e owl:Class is a subclass of rdfs:Class. Like rdfs:Class, an owl:Class 
groups instances that share common properties. However, this 
new OWL term is defined due to the restrictions on DL-based 
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Fig. 6 Examples of instances of orth:SequenceUnit and orth:Gene and object and datatype property assertions 


3.5 RDF Serialization 
Formats 


OWL languages (e.g., OWL DL and OWL Lite; OWL 2 DL and 
its syntactic fragments EL, QL, and RL). These restrictions 
imply that not all RDES classes are legal OWL DL/OWL 
2 DL classes. For example, the orth:Sequence Unit entity in the 
ORTH ontology is stated as an OWL class (i.e., orth:Sequence U- 
nit rdf:type owl:Class—Fig. 5d illustrates this axiom). Therefore, 
orth:SequenceUnit is also an RDES class since owl:Class is a 
subclass of rdfs:Class. 


e owl:ObjectProperty is a subclass of rdf:Property. The instances 
of owl:ObjectProperty are object properties that link individuals to 
individuals (Ge, members of an owl:Class). For example, the 
orth:hasHomolog object property (see Fig. 5e) relates one orth: 
Sequence Unit individual to another one. Figure 5a illustrates this 
example. 


e owl:DatatypeProperty is a subclass of rdf:Property. The 
instances of owl:DatatypeProperty are datatype properties that 
link individuals to data values. To illustrate a datatype property, 
we can mention the oma:ensemblGeneld (see Figs. 5f and 6b). 
This property asserts a gene identifier to an instance of an orth: 
Gene. 


Further information about OWL languages are available as 
World Wide Web Consortium (W3C) recommendations in [46] 
and [47]. 


RDF is a graph-based data model which provides a grammar for its 
syntax. Using this grammar, RDF syntax can be written in various 
concrete formats which are called RDF serialization formats. For 
example, we can mention the following formats: Turtle [48], 
RDF/XML (an XML syntax for RDF) [49], and JSON-LD 
(a JSON syntax for RDF) [50]. In this section, we will solely 
focus on the Turtle format. 

Turtle language (TTL) allows for writing an RDF graph in a 
compact textual form. To exemplify this serialization format, let us 
consider the following turtle document that defines the homolo- 
gous and orthologous relations: 
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3.6 Querying the 
Semantic Web with 
SPARQL 


@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
@prefix owl: <http://www.w3.org/2002/07/owl#> 

@prefix orth: <http://purl.org/net/orth#> 


# http://purl.org/net/orth#SequenceUnit 
orth:SequenceUnit rdf:type owl:Class 


orth:hasHomolog rdf:type owl:ObjectProperty ; 
rdf:type owl:SymmetricProperty ; 
rdfs:domain orth:SequenceUnit ; 


rdfs:range orth:SequenceUnit 


orth:hasOrtholog rdf:type owl:ObjectProperty ; 
rdfs:subPropertyOf orth:hasHomolog . 


This example introduces many of features of the Turtle lan- 
guage: @prefix and prefixed names (e.g., @prefix rdfs: 
http: //www.w3.org/2000/01/rdf-schema#), predicate lists 
separated by “;” (e.g., orth:hasOrtholog rdf:type owl: 
ObjectProperty; rdfs:subPropertyOf orth:hasHomo- 
log.), comments prefixed with “#” (e.g., # http://purl.org /net/ 
orth#Sequence Unit), and a simple triple where the subject, predi- 
cate, and object are separated by white spaces and ended with a “.” 
(e.g., orth: SequenceUnit rdf: type owl:Class). 

Further details about TTL serialization are available as a W3C 


recommendation in [48] 


Once we have defined the knowledge base (TBox and ABox), how 
can we use it to retrieve relevant data? Similar to SQL for relational 
databases, data in RDF stores can be accessed by using a query 
language. One of the main RDF query languages, especially used 
in the field of life sciences, is SPARQL [51]. A SPARQL query 
essentially consists of a graph pattern, namely, conjunctive RDF 
triples, where the values that should be retrieved (the unknowns— 
either subjects, predicates, or objects) are replaced by variable names, 
prefixed by “?”. Looking again at the previous example, if we want to 
get the description of the “HBB” protein from OMA, we would 
simply use a graph pattern, where the value of the “description”— 
the one we want to retrieve—is replaced by a variable as follows: 


SELECT ?description WHERE { 
?Pprotein oma:geneName “HBB”. 


?Pprotein biositemap:description ?description. 


The choice of variable name itself is not important (we could 
have used “?x”, “evar”, etc., albeit with a loss of readability). 
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Essentially, we are interested in the description of a protein about 
which we only know a name—“HBB.” 

In order to get a sense of how large bioinformatics databases 
currently are, but also to get a hands-on introduction into how they 
can be queried using SPARQL, we propose to retrieve the total 
number of proteins in UniProt in Exercise A at the end of this 
chapter. Furthermore, Exercise C will allow trying out and refining 
the OMA query introduced above, but also writing a new one, 
using the OMA SPARQL endpoint. 


4 Modeling Biological Databases with Semantic Web Technologies 


In this section we show a concrete example of how we can use 
Semantic Web technologies to model the two biology databases 
Bgee and OMA. 

Figure 7 illustrates a fragment of a candidate ontology describ- 
ing the relational database sample from Bgee (see Fig. 1). The 
ellipses illustrate classes of the ontology, either specific to the 
Bgee ontology, such as AnatomicEntity (the equivalent of the 
anatEntity table in the relational view), or classes from imported 
ontologies, such as the Taxon class (the prefix “up:” denoting the 
UniProt ontology, http: //purl.uniprot.org/core/). The advantage 
of using external De, imported) classes is that integration with 
other databases which also instantiate these classes will be much 
simpler. For example, we will see that the class Gene serves as the 
“Join point” between OMA and Bgee. Arrows define properties of 
the ontology: either datatype properties (similar to attributes of a 
table in the relational world), such as the speczesName or the stage- 
Name, or object properties, which are similar to primary key-foreign 
key relationships, given that they link instances of one class to those 
of another. If we compare Fig. 7 (the ontology view) against Fig. 1 
(the relational view), we notice that the object properties zsExpres- 
sedIn and isAbsentIn only appear explicitly in the ontology. This is 
because the values of these properties will actually be calculated 
on-the-fly, from multiple attributes in the relational database. Given 
that Bgee is mainly used to query gene expressions, these properties 
are exposed as new semantic properties in the domain ontology, 
namely, expression or absence of expression of a gene in a particular 
anatomic entity. This is one of the means through which the 
semantic layer can not only describe but also enrich the data avail- 
able in the underlying layers (in this case, in the relational database). 
The domain of both the isExpressedIn and isAbsentIn properties is 
in this case a gene, while the range is an anatomic entity, such that 
triples that instantiate this relationship will have the structure: 
<Gene, isExpressedIn, AnatomicEntity>. 

Given that the OMA ontology is significantly larger than the 
one for Bgee, we only show here the class hierarchy in Fig. 8. The 
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Fig. 8 The class hierarchy of the OMA ontology. Ellipses indicate class labels, while arrows indicate the “rdfs: 
subClassOf ” property. Further details are available in [13] 


most important concepts in the ontology are shown in the top right 
corner, namely, the cluster of orthologs and the cluster of paralogs, 
which store information about gene orthology (or paralogy) in a 
hierarchical tree structure (the gene-tree node). Similarly to the 
Bgee ontology, the Gene class in OMA is external. Arrows indicate 
the “rdfs:subClassOf” relationship—for example, both the “Clus- 
ter of Orthologs” and the “Cluster of Paralogs” classes—are 
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subclasses of the “Cluster of Homologs” class. For a description of 
the ontology, as well as a discussion regarding its design within the 
Quest for Orthologs Consortium, we point the reader to [13]. Fur- 
thermore, the ontology can be explored or visualized in Web- 
VOWL [52] using the Web page of the OMA SPARQL endpoint 
[53] available online at https: //sparql.omabrowser.org/sparql. 

Until here we have explored a few relatively simple examples in 
order to get familiar with the basics of Semantic Web technologies 
(URIs, RDF triples, and SPARQL). However, we can now intro- 
duce a more complex query that will better illustrate the expressiv- 
ity of the SPARQL query language for accessing RDF stores—that 
is, for integrating and joining data across different databases. 

Since all RDF stores structure data using the same standard 
model for data interchange, the main requirements in order to 
efficiently join multiple sources are: 


1. That they each expose data through a SPARQL endpoint that 
supports federation (SPARQL 1.1) 


2. That the sources share URIs or ontologies 


This is the reason why already today we can jointly query, for 
example, OMA and UniProt—essentially, integrating the two data- 
bases by means of executing a federated SPARQL query. 

To illustrate this, let us consider the following example: what 
are the human genes available in the OMA database that have a 
known association with leukemia? OMA does mot contain any 
information related to diseases, however, UniProt does. In this 
case, since OMA already cross-references UniProt with the oma: 
xrefUniprot property, we can write the following federated 
SPARQL query, which will be running at the OMA SPARQL 
endpoint: 


select distinct ?proteinOMA ?proteinUniProt 


where { 


service <http://sparql.uniprot.org/sparql> { 


?proteinUniProt a up:Protein . 
?proteinUniProt up:organism taxon:9606 . # Homo Sapiens 


?proteinUniProt up:annotation ?Pannotation . # annotations of this protein 


Pannotation rdfs:comment ?text 
regex(str(?text), "leukemia") ) # only those containing the 


text "leukemia" 


?proteinOMA a orth:Protein. 


?proteinOMA oma:xrefUniprot ?proteinUniProt. 


We skip the details regarding the prefixes used in the example 
and focus on the new elements in the query. The main part to point 
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out is the “service <http:/spargl.uniprot.org/spargl>” block, delim- 
ited between the inner brackets. This enables using the SPARQL 
endpoint of UniProt remotely, as a service. Through this mecha- 
nism, the query will first fetch from UniProt all instances of pro- 
teins that are annotated with a text that contains “leukemia” (this is 
achieved by the filter keyword in the service block). Then, using the 
cross-reference oma:xrefUniprot property, the query will return all 
the equivalent entries from OMA. From here, the user can explore, 
either in the OMA browser or by further refining the SPARQL 
query, other properties of these proteins: for example, their ortho- 
logs in a given species available in the database. In Exercise D at the 
end of this chapter, we encourage the reader to try this out in the 
OMA SPARQL endpoint. Note that the same results can be 
obtained by writing this query in the UniProt SPARQL endpoint 
and referencing the OMA one as a service. For an overview of 
federation techniques for RDF data, we refer the reader to the 
survey [54]. 

The mechanisms illustrated so far, while indeed powerful for 
federating distinct databases, have a major drawback: they require 
the user to know the schema of the databases (otherwise, how 
would we know which properties to query in the previous exam- 
ples?), and, more importantly, they require all users to be familiar 
with a technical query language, such as SPARQL. While very 
expressive, formulating such queries can quickly become over- 
whelming for non-programmer users. In the following, we will 
look at techniques that aim to overcome these limitations. 


5 Ontology-Based Integration of Heterogeneous Data Stores 


5.1 A System’s 
Perspective 


So far we have seen some of the alternatives available for storing 
biological data—relational databases and triple stores. In this sec- 
tion, we look at how these heterogeneous sources can be integrated 
and accessed in a unified, user-friendly manner that does not 
require knowledge of the location or structure of the underlying 
data nor of the technical language (SQL or SPARQL) used to 
retrieve the data. The architecture we present is inspired by work 
presented in [21], which focused strictly on keyword search in 
relational databases. 


We start with a bottom-up description of the layers that make up an 
integrated data access system, followed by a concrete example using 
the two bioinformatics databases introduced above: the orthology 
database OMA and the gene expression database Bgee. 

The main four layers of an integrated data access system, as 
shown in Fig. 9, are: 


5.1.1 


5.1.2 Data Model Layer 


5.1.3 
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Base Data Layer 


Integration Layer 


Presentation Layer 


| Integration Layer | 


| Data Model 


Fig. 9 Integrated data access system 


This represents the physical storage layer, where all the actual data, 
for example, experimental results, annotations, etc., are kept. 
Figure 9 illustrates only a few of the possible storage types, namely, 
relational databases, hierarchical data stores (e.g., HDF5), and 
RDF stores. At this low-level layer, the data are usually structured 
so as to optimize machine parameters, such as storage space, com- 
plexity of joins required to answer physical queries, etc. Therefore, 
it is not designed for human readability. Furthermore, tables, col- 
umn names, or even IDs may not match any real terms. For exam- 
ple, the Bgee relational database uses the table name “anatEntity” 
to refer to the term “anatomic entity,” while others may be even 
further away from the original terms. 


This layer is used to describe, at a higher level of abstraction, the 
data contained in the physical storage. Here, for example, original 
names for terms are recovered while also creating a mapping 
between these higher-level terms (“Anatomical Entity”) and their 
corresponding physical layer location (table “anatEntity” in schema 
Bgee). The data model layer can be viewed as the first semantic layer 
in the system, as it allows representing the actual terms referred to 
in the underlying physical storage while abstracting away the details 
of the actual structure of the physical storage. The data model layer 
can be understood as an ontology, however, only applicable to the 
level of an individual database. 


The integration layer performs a similar task to the data model 
layer, in that it defines a mapping between high-level concepts 
(“Anatomical Entity”) and all the occurrences where these concepts 
can be found in the physical storage (table “anatEntity” in schema 
Bgee, class “Anatomic Entity” in UniProt, etc.). In doing so, the 
integration layer also aligns the different data models, by defining 
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5.1.4 Presentation Layer 


which identifiers from one data model correspond to which ones 
from the others. In the case of biological databases, this is usually 
done by taking into account cross-references, which already exist 
between most databases, as we have seen in the SPARQL query in 
Sect. 5. 

While the data model layer can be seen as a local ontology, the 
integration layer will serve as a global ontology. The integration 
layer can be queried using, for example, SPARQL. However, in 
order to get the results from the underlying sources, the SPARQL 
query needs be translated in the native query languages of the 
underlying sources (e.g., SQL for relational databases). This is 
achieved by using the mappings defined in the global ontology. 
For example, the keyword “expressed in” does not have a direct 
correspondence in Bgee, but it can be translated into an SQL 
procedure (in technical terms, it represents an SQL view of the 
data). Without going into details, at a high level, the property 
“gene A expressed in anatomic entity B” will be computed by 
looking at the number of experiments stored in the database, 
showing the expression of A in B. It is conceivable that in another 
database, which could also form part of the integrated system, this 
information is available explicitly. In this case the mapping would 
simply be a 1-to-1 correspondence to the property value stored in 
the database. The role of the integration layer is to capture æli the 
occurrences where a certain concept (entity or property) can be 
found, along with a mapping for each of the occurrences, defining 
how information about this concept can be computed from the 
base data. 

To summarize, the integration layer abstracts away the Jocation 
and structure of data in the underlying sources, providing users a 
unified access through a global ontology. One of the drawbacks of 
this approach is that, in the lack of a presentation layer, such as a 
user-friendly query interface (e.g., a visual query builder or a 
keyword-based search interface), the data represented in the global 
ontology is accessible mainly through a technical query language, 
such as SPARQL. Therefore, in order to be able to access the data, 
users are required to become fluent in the respective query 
language. 

It is worth at this point mentioning that most data integration 
systems available at the time of this writing only offer the three 
layers presented so far. Examples of such systems, generically 
denoted as ontology-based data access (OBDA) systems, are 
Ontop [55], Ultrawrap [56], or D2RQ [57]. 


The three layers presented so far already achieve data integration, 
but with a significant drawback, which is that the user is required to 
know a technical query language, such as SPARQL. The role of the 
presentation layer is to expose data from all integrated resources in 
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5.2 A Concrete 
Example: A Global 
Ontology to Unify OMA 
and Bgee 


an easy to access, user-friendly manner. The presentation layer 
abstracts away the structure of the integration layer and exposes 
data through a search interface that users (including 
non-programmers) are familiar with, such as keyword search 
[21, 58] or even full natural language search [59, 60]. 

The challenges in building the presentation layer are manyfold: 
first, human language is inherently ambiguous. As an example, let 
us assume a user asks: “Is the HBB gene expressed in the blood?” 
What does the user mean? The hemoglobin gene (HBB) in general? 
Or just in the human? The system should be proactive in helping 
the user clarify the semantics or intents of the question, before 
trying to compute the underlying SPARQL query. Second, the 
presentation layer should provide not only raw results but also an 
explanation—for example, what sources were queried, how many 
items from each source have been processed in order to generate 
the response, etc. This enables the user to validate the generated 
results or to otherwise continue refining the question. Third, the 
presentation layer must also rank the results according to some 
relevance metric, similarly to how search results are scored in Web 
search engines. Given that the number of results retrieved from the 
underlying sources can easily become overwhelming (e.g., search- 
ing for “HBB” in Bgee returns over 200 results), it is important 
that the most relevant ones are shown first. 

From a technical point of view, the presentation layer maintains 
an index (i.e., the vocabulary) of all keywords stored in the lower 
layers, both data and metadata (descriptions, labels, etc.), such that 
each keyword in a user query can be mapped to existing data in the 
lower layers. An important observation is that the presentation 
layer highly relies on the quality of the annotations available in the 
lower layers. In the lack of human-readable labels and descriptions 
in the global ontology, the vocabulary collected by the presentation 
layer will miss useful terms that the user might search for. One way 
to detect and fix this problem is to always log user queries and 
improve the quality of the annotations “on demand,” whenever the 
queries cannot be solved due to missing items in the vocabulary. 
For a more extended discussion on the topic of labels and their role 
in the Semantic Web, refer to [61]. 

Finally, it is worth noting that none of these layers need to be 
centralized—indeed, even in the case of the integration layer, 
although its role is to build a common view of all data in the physical 
storage, it can be distributed across multiple machines, just as long 
as the presentation layer knows which machine holds which part of 
the unified view. 


So far we have seen an abstract view of a system for data integration 
across heterogeneous databases. It is time to look at how this 
translates into a real-world example, using the Bgee relational 
database and the OMA RDF database. 
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Fig. 10 A sample global ontology for integrating OMA and Bgee and an example assertion 


The top part of Fig. 10, the terminological box, illustrates part 
of the global ontology (layer 3, integration layer) for the two 
databases, with most of the terms being part of OMA, except for 
Anatomic Entity, which is specific to Bgee. As mentioned previ- 
ously, OMA extends the ORTH ontology, which is why the 
corresponding terms in the ontology are prefixed with “orth:.” 
The Gene concept can actually be found in both Bgee and OMA; 
therefore the global ontology will define mappings to both sources. 
As we can see in the ontology, the Gene is the common point that 
joins together OMA and Bgee. The gene IDs used in both data- 
bases are Ensembl IDs [62], stored in the ensemblGeneld string 
property. For example, the human hemoglobin gene, “HBB,” 
which we previously showed as an example entry in OMA, corre- 
sponds to the ENSG00000244734 Ensemble ID and can also be 
found in Bgee. 


Semantic Integration and Enrichment of Heterogeneous Biological Databases 677 


5.3 How to Link a 
Database with an 
Ontology? 


The lower part of Fig. 10, the assertional box, illustrates an 
example assertion—in this case, that the protein HUMAN22168 
in OMA is orthologous to the protein HORSE13872 and that, 
furthermore, this protein is encoded by the gene with the Ensemble 
ID ENSG000001639936. Moreover, this gene is expressed in the 
brain (the Uberon ID for this being “UBERON:0000955”). The 
human-readable description is stored in the String literal /abel—as, 
for example, the name of the anatomic entity, “brain,” shown in the 
bottom-right corner in the figure. Without labels, much of the 
available data would not be easily searchable by a human user nor 
by an information retrieval system. 

Note that with this sample ontology, we can already answer 
questions related to orthology and gene expression jointly, such as 
the first part of our introductory query: “What are the human-rat 
orthologs, expressed in the liver...?”. This question essentially 
refers to pairs of orthologous Genes (those in human and rat) and 
their expression in a given Anatomic Entity (the liver). Apart from 
the Species class, which is not explicitly shown, all of the information 
is already captured by the ontology in Fig. 10. A similar mechanism 
can be used to further extend this to UniProt (for instance, based 
again on gene IDs as the “join point,” or by using existing cross- 
references, as we have shown in the previous section), therefore 
enabling users to ask even more complex queries. 


One of the main challenges in implementing technologies for the 
Semantic Web was recognized from early on (see the study pub- 
lished in 2001 by Calvanese et al. [63]) to be the problem of 
integrating heterogeneous sources. In particular, one of the observa- 
tions made was that integrating legacy data will not be feasible 
through a simple 1-to-1 mapping of the underlying sources into 
an integrative ontology (e.g., mapping all attributes of tables in 
relational databases to properties of classes in an ontology), but 
rather through more complex transformations, that map views of 
the data into elements of the global ontology [63]. 

To illustrate this with a concrete example, let us consider again 
the unified ontology for OMA and Bgee that we introduced in the 
previous section. Although Figure 10 shows properties such as 
“gene isExpressedIn” or “gene hasOrtholog,” this data is actually 
not explicitly stored in the underlying databases but rather needs to 
be computed on-the-fly based on the available data. For example, 
the “isExpressedIn” property can be computed based on the num- 
ber of experiments which show the expression of a gene in a certain 
anatomic entity in Bgee. Deciding the exact threshold for when a 
gene is considered as “expressed” according to the data available is 
not straightforward and needs to be agreed upon by domain spe- 
cialists. Therefore, the integration layer will also serve to enrich the 
data available in the underlying layers, by defining new concepts 
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5.4 Putting Things 
Together 


based on this data (e.g., the presence or absence of gene expression 
in an anatomic entity). 

At this point it is worth clarifying an important question: why 
are mappings necessary? Why is it not enough to replicate the data 
in the different underlying formats into a single, uniform way (e.g., 
translate all RDB data into RDF)? The answer is that not only 
would such a translation require a lot of engineering effort, but 
more importantly, it would transform the data from a format that is 
highly optimized for data access, into a format that is optimized for 
different purposes (data integration and reasoning). Querying rela- 
tional databases still is, today, the most efficient means of accessing 
very large quantities of structured data. Transforming all of it into 
RDF would in many cases mean downgrading the overall perfor- 
mance of the system. In some cases storing RDF data in the 
relational format was proven to be more efficient [64]. 

So how are mappings then created? One of the main mechan- 
isms to achieve this is currently the W3C standard R2RML, avail- 
able as a W3C recommendation online [65]. R2RML enables 
mapping relational data to the RDF model, as chosen by the 
programmer. For a concrete example of how mappings can be 
defined and what are the advantages of this approach, we refer the 
reader to [66]. A mapping essentially defines a view of the data, 
which is a query (in this case, an SQL query) that allows retrieving a 
relevant portion of the underlying data, in order to answer a higher- 
level question (e.g., what is “expressed in”?). The materialization of 
this query (the answer) will be returned in RDF format, on 
demand, according to the mapping. This avoids duplicating or 
translating data in advance from the underlying relational database 
into RDF until it is really needed, in order to answer a user query. 

For a discussion regarding the limitations of R2ARML and 
alternative approaches to define mappings from relational data to 
RDF, we refer the reader to the survey [67]. 


So far we have seen how individual sources can be represented into 
a single, unified ontology, and we had a high-level view of a data 
access system that enables users to ask queries and get responses in a 
unified way, without knowledge of where data is located or how it is 
structured. In this section we finally look at how all of these com- 
ponents can work together in answering natural language queries 
on biological databases. Although there are multiple alternatives to 
natural language interfaces, including visual query interfaces or 
keyword-based search interfaces, it has been shown that natural 
language interfaces are the most appropriate means to query 
Semantic Web data for non-technical end-users [68]. As a conse- 
quence, natural language querying, based on Semantic Web tech- 
nologies, is currently one of the active areas of research, examples of 
recent systems implementing an ontology-based natural language 
interface including the Athena [59] and TRDiscover [60] systems. 
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First, recall the user question we formulated in the beginning of 


this chapter: “What are the human-rat orthologs, expressed in the 
liver, that are associated with leukemia?” Let us assume the 
resources at hand to answer this question are the biological data- 
bases OMA, Bgee, and UniProt. The four main steps required to 
translate the natural language question into the underlying query 
languages of OMA, Bgee, and UniProt will be: 


(a) 


Identify entities in the query 

This is the natural language processing step that extracts 
the main concepts the user is interested in, based on the key- 
words of the input query: orthologs, human, rat, expressed, 
liver, associated, and leukemia. 


Identify matches of the entities in the integrative ontology 
The extracted keywords will be searched for in the vocabu- 
lary of the presentation layer, resulting in one or multiple URIs, 
given that a keyword can match multiple concepts. For exam- 
ple, the keyword “orthologs” can match either the entity 
“OrthologCluster” or the property “hasOrtholog” of a gene 
in OMA. The index of the presentation layer will also return the 
location the URI originates from (OMA or Bgee or UniProt). 


Construct subqueries for each of the matches 

The extracted URIs will be used to construct subqueries 
on each of the underlying data sources. This step requires 
translating the original query into the native language of 
each underlying database, with specific mechanisms for each 
type of database (relational or triple store). At a high level, the 
translation process involves finding the minimal sub-schema 
(or subgraph in the case of RDF data) that covers all the 
keywords matched from the input query. Taking the example 
previously shown in Fig. 10, the minimal subgraph that con- 
tains “orthologs” and “expressed” will essentially contain only 
two nodes of the entire graph: Gene (which is both the domain 
and the range of the “hasOrtholog” property in the Orthol- 
ogy Ontology) and AnatomicEntity (which is the range of the 
“isExpressedIn” property in the Bgee ontology). All the 
unknowns of the query (e.g., which ortholog genes) are 
replaced by variables. The final subqueries for OMA and 
Bgee might therefore (informally) look like this: 


OMA: select ?genel ?gene2 where { 
?proteinl a Protein. 
?proteinl inTaxon “Homo sapiens”. 
?proteinl isEncodedBy ?genel. 
?proteinl hasOrtholog ?protein2. 
?protein2 inTaxon “Rattus norvegicus”. 


?Pprotein2 isEncodedBy ?gene2. 
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Bgee: 


Note that we have simplified the actual query for readabil- 
ity purposes (using the literals “Homo sapiens” and “Rattus 
norvegicus” instead of their corresponding URIs). This sub- 
query will cover the keywords: ortholog, human, and rat. 
Notice that the query should return genes, not proteins, 
because the join point between OMA and Bgee is the Gene 
class. 


select ?gene where { 


?gene a Gene. 


?gene isExpressedIn ?anatomicEntity. 


PanatomicEntity rdfs:label “liver”. 


This subquery will therefore cover the expressed and liver 
keywords. The final step will be then to get the similar sub- 
query for UniProt (which we omit here for brevity) and to 
compute the joint result, namely, the intersection between all 
the sets returned by the subqueries. 


Join the results from each of the subqueries 

This final step is essential in keeping the performance of 
the system to an acceptable level. Joining (federating) the 
results of several subqueries into a unified result is not an 
easy task and requires a careful ordering of the operations 
from all subqueries. To understand this problem, let us con- 
sider again our example and try to see how many results each 
of the subqueries will return. First, if we take a look at the 
OMA browser and try to find all orthologs between human 
and rat, this will amount to more than 21,000 results. How- 
ever, is the user really interested in all of them? Certainly not, 
as the input query shows—the user is only interested in a small 
fraction of the orthologs, namely, those that are expressed in 
the liver and have an association with leukemia (according to 
the data stored in Bgee and UniProt). How many are these? If 
we now refer to UniProt and look for the disease leukemia, we 
will find that there are only 20 entries which illustrate the 
association with this disease. Clearly, getting only the ortho- 
logs of these 20 entries will be much more efficient than 
retrieving all 21,000 pairs from OMA first and then removing 
most of them to only keep relevant ones. 

However, note that in this case, we only know this infor- 
mation because we constructed the queries and tried them out 
by hand first. How should the system estimate the number of 
results (i.e., the cardinality of each subquery) in advance? This 
question has been an active area of research for a long time. 
Some of the methods used to tackle this problem are either to 
precompute statistics regarding the number of results available 
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in different tables of the underlying sources [69] or to use 
statistics regarding previously asked queries to optimize the 
new ones, for example, via statistical machine learning [70]. In 
the first case, we would, for instance, store the individual 
counts of different orthologous pairs while also keeping statis- 
tics about diseases if we expect these types of questions to be 
asked frequently, whereas in the second case, we would simply 
look at the number of results similar subqueries generated in 
the past, to optimize which results to fetch first. For a recent 
study of optimization methods for federated SPARQL 
queries, see [71]. 
(e) Present the user the final results 

Finally, the joined results are returned to the user, along 
with an explanation regarding the constructed query and the 
entities that were matched in order to construct it. In this way, 
the user has the opportunity to validate the correctness of the 
answer or otherwise to further refine the question. 

For a more in-depth discussion regarding natural lan- 
guage query interfaces in ontology-based data access systems, 
we refer the reader to Athena [59] and TRDiscover [60]. 


6 Timeline of Semantic Web Technologies and Ontology-Based Data Integration in 
Life Sciences 


The field of life sciences has been an early adopter of Semantic Web 
technologies, due to the need of interoperability and integration of 
biological data spread across different databases. In this section, we 
provide a brief timeline (see Fig. 11), including the example ontol- 
ogies introduced in this chapter. 


— 1995: Davidson et al. [72] suggest basic steps to integrate 
bioinformatics data (common data model, match semantically 
related objects, schema integration, transform data into feder- 
ated database, match semantically equivalent data). 


— 2000: TAMBIS (Transparent Access to Multiple Bioinfor- 
matics Information Sources) [73] proposes a unified ontology 
covering many aspects of the bioinformatics knowledge space. 


— 2000: The “Gene Ontology—a tool for the unification of 
biology” [37] is the first significant milestone in unifying 
diverse biological databases, focusing on gene functions. Even 
before the publication of the Semantic Web paper by Tim Ber- 
ners Lee (in the following year), the GO highlighted the benefits 
of controlled vocabularies and standardized naming, both pre- 
cursors of Semantic Web technologies, which were adopted in 
the GO in the year 2002 [74]. Today it is, arguably, the most 
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Davidson et al. suggest basic steps to integrate bioinformatics data 


TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources 


Gene Ontology 


BioMoby: a unified registry of web services for life scientists 


Integrating biological databases in Nature Reviews Genetics 


UniProt: The Universal Protein Knowledge 


First International Workshop on Data Integration inthe Life Sciences 
HCLS IG: Semantic Web Health Care and Life Sciences Interest Group 


OBO Foundry: Open Biological and Biomedical Ontology Foundry 


OLS: Ontology Lookup Service 


National Center for Biomedical Ontology (NCBO) BioPortal: a web portal to biomedical 


ontologies 


BioMoby: interoperable access to over 1400 bioinformatics resources 


BioGateway: a semantic systems biology tool forthe life sciences 


Special issue Database Integration in Life Sciences in Briefings in Bioinformatics 


Review on Ontologies and Semantic Web Technologies in Briefings in Bioinformatics 
NCBO launches a SPARQL endpoint 
Semantic Web meets Integrative Biology 


Orthology Ontology 


Fig. 11 A selective timeline of data integration efforts in life sciences 


comprehensive resource of computable knowledge regarding 
gene functions and products. 


2001: Launch of the BioMoby project [75 ] providing a unified 
registry of Web services for life scientists using a consensus- 
driven approach. It listed, for instance, all services converting 
gene names to GO terms or all databases accepting GO terms. 
The registry is currently no longer maintained. 


2003: A Nature Reviews Genetics article on Integrating 
Biological Databases [76] highlights the “database-surfing” 
problem (i.e., the time-consuming process of manually visiting 
multiple databases to answer complex biological research ques- 
tions) and argues for standardized naming of biological objects to 
overcome the problem. Link integration, view integration, and 
data warehousing are proposed for data integration. Arguably, 
link integration has since become the most adopted solution. 


2003: Launch of UniProt [77] by the UniProt Consortium, a 
collaboration between the Swiss Institute of Bioinformatics 
(SIB), the European Bioinformatics Institute (EBI), and the 
Protein Information Resource (PIR). UniProt is the world’s 
most comprehensive freely accessible resource on protein 
sequences and functional annotation. Since 2008 the data is 
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published in RDF, and since 2013 a SPARQL endpoint is 
provided [78]. 

— 2004: The first International Workshop on Data Integration 
in the Life Sciences, held in Leipzig, promotes “a Bioinformat- 
ics Semantic Web” and highlights solutions for heterogeneous 
data integration. The workshop continues to be held every year, 
and its proceedings (e.g., [79]) provide a good overview of 
advances in the field. 


— 2005: The W3C Consortium launches the Semantic Web 
Health Care and Life Sciences Interest Group (HCLS IG) 
to develop the use of Semantic Web technologies to improve 
health care and life sciences research. Today, the HCLS Linked 
Data Guide [80] provides best practices for publication of 
biological Linked Data on the Web. 


— 2006: The OBO Foundry [40] establishes principles for ontol- 
ogy development and evolution to support biomedical data 
integration through a suite of orthogonal interoperable refer- 
ence ontologies. 


— 2006: Publication of the Ontology Lookup Service (OLS), a 
repository for biomedical ontologies with the aim to provide a 
single point of access (with controlled vocabulary queries) to the 
latest ontology versions. It allows interactive browsing, as well as 
programmatic access [81 ]. 


— 2007: Launch of the National Center for Biomedical Ontol- 
ogy (NCBO) BioPortal [82], a web portal to biomedical 
ontologies. OBO ontologies are a central component. The por- 
tal started with 50 ontologies; to date it is the most comprehen- 
sive repository with currently 852 biomedical ontologies and 
more than eight million classes. 


— 2008: Launch of the BioMoby Consortium [83] and the first 
release of the BioMoby Semantic Web Service, at the time 
providing interoperable access to over 1400 bioinformatics 
resources worldwide. 


— 2008: BioGateway [84] provides a single SPARQL entry point 
to all OBO candidate ontologies, the GO annotation files, the 
SWISS-PROT protein set, the NCBI taxonomy, and several 
in-house ontologies. 


— 2008: The Briefings in Bioinformatics journal launches a 
special issue dedicated to Database Integration in Life 
Sciences [85], acknowledging the major challenge of integrat- 
ing data scattered over millions of publications and thousands of 
heterogeneous databases. 


— 2008: Bio2RDF [86] applies Semantic Web technology to 
various publicly available databases (converting them into RDF 
format and linking with normalized URIs and a common 
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ontology). Updates continue to be provided for increased inter- 
operability among bioinformatics databases [87, 88]. 


— 2009: Briefings in Bioinformatics publishes a review on 
Biological Knowledge Management [89], highlighting the 
transforming role of ontologies and Semantic Web technologies 
in enabling knowledge representation and extraction from het- 
erogeneous bioinformatics databases. 


— 2010: NCBO launches a SPARQL endpoint, available at 
http: //sparql.bioontology.org/. 
— 2012: Publication of a survey highlighting the benefits of inte- 


gration using Semantic Web technologies in the field of Inte- 
grative Biology [90]. 


— 2016: Publication of the Orthology Ontology [13]. 


7 Conclusions and Outlook 


8 Exercises 


Data integration is arguably one of the most important enablers of 
new scientific discoveries, given that research data is currently 
growing at an unprecedented rate. This is especially true in the 
case of biological databases. While data integration poses many 
challenges, the emergence of standards, integrative ontologies, as 
well as the availability of cross-references between many of the 
biological databases make the problem easier to tackle. This chapter 
has provided a brief introduction to the methods that can be used 
to integrate heterogeneous databases using Semantic Web technol- 
ogies while also providing a concrete example of achieving this goal 
for three well-known existing biological databases: OMA, Bgee, 
and UniProt. 

Although there would be many more aspects to cover and 
much of the work for achieving wide-scale data integration still 
remains to be done, we would like to end this chapter by reinfor- 
cing the following conclusion, extracted from a study of Biological 
Ontologies for Biodiversity Knowledge Discovery [91]: 


We hope that current work will spur interest and feedback from scientists and 
bioinformaticians who see data integration, interoperability, and reuse as the 
solution to bringing the past 300 years of biological exploration of the planet 
into currency for science and society. 


A. Querying UniProt with SPARQL 


The goal of this warm-up exercise is to get familiar with a SPARQL 
endpoint and to write your first SPARQL query. For this purpose, 
open the link to the UniProt SPARQL endpoint, http://sparal. 


Semantic Integration and Enrichment of Heterogeneous Biological Databases 685 


uniprot.org/ in a Web browser. How many entries do you think are 
available in UniProt? To find out, simply check the bottom-left 
corner of the Web page—you will notice that the total number of 
triples is always kept up to date there. How many of these entries 
describe proteins? To find out, try running the following SPARQL 
query that counts all instances of the database that belong to the 
protein class. What is the result? 


PREFIX up:<http://purl.uniprot.org/core/> 
SELECT (count(?protein) as ?count) 
WHERE 


{ ?Pprotein a up:Protein. } 


Notice that the UniProt SPARQL web page includes many 
examples on the right-hand side—in order to get more familiar 
with UniProt and SPARQL, try further some of the sample queries 
provided there. 


B. Exploring Biological Ontologies Through Keyword Search in 
the Ontology Lookup Service 


We have seen in Sect. 3.6 an example assertion about the 
“HBB” gene in the human, including the following triple: 


oma: PROTEIN_HUMAN04027 obo:RO_0002162 <http://www.uniprot. 


org/taxonomy/9606> . 


This triple essentially asserts that the gene is located in the 
Homo sapiens taxon. However, as a regular user, how could you 
know what the URIs for “in taxon” and Homo sapiens are? One of 
the possible ways to get these identifiers is by searching for the 
keywords of interest in the Ontology Lookup Service (OLS). To do 
this, go to the Web page of the service https: //www.ebi.ac.uk/ols 
index, and try to enter first “in taxon”. What is the result? Try also 
Homo sapiens. What about “human”? 


C. Querying OMA with SPARQL 


Recall from Sect. 3.6 the sample query we presented for retriev- 
ing the description of the human hemoglobin gene from OMA. We 
provide it in a more explicit form here: 


SELECT ?description WHERE { 
?Pprotein oma:geneName "HBB". 


Pprotein <http://bioontology.org/ontologies/biositemap.owl#description> ?de- 


scription. 


First try to think about possible information that is missing 
from this query. For example, is this query guaranteed to return a 
single result (remember we are using an orthology database)? 
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Try to look again at how the human “HBB” protein is defined 
in Sect. 3. Then, try to run the SPARQL query as-is in the OMA 
SPARQL endpoint: https://sparql.omabrowser.org/sparql. What 
do you get? What is the reason? Try to print out more information 
about the protein, not just its description. For example, add 
another triple pattern to capture the oma:hasOMAId property 
value as well (don’t forget to add it to the selected variables in the 
first line!), perhaps also the taxon ID in UniProt. What can you 
deduce? Can you correct the query so that it only gets the descrip- 
tion we were originally interested in? 


D. Federated Queries Using SPARQL (OMA and UniProt) 


In Sect. 4 we presented an example Federated Query using the 
SPARQL endpoint of OMA and the remote SPARQL endpoint of 
UniProt, as a service. We recall the query here: 


prefix up:<http://purl.uniprot.org/core/> 

prefix taxon:<http://purl.uniprot.org/taxonomy/> 

select distinct ?proteinOMA ?proteinUniProt 

where { 

service <http://spargl.uniprot.org/sparql> { 

?PproteinUniProt a up:Protein . 
?proteinUniProt up:organism taxon:9606 . # Homo Sapiens 
?PproteinUniProt up:annotation ?Pannotation . # annotations of this 
protein entry 
Pannotation rdfs:comment ?text 
filter( regex(str(?text), "“leukemia") ) # only those containing 


the text "leukemia" 


?proteinOMA a orth:Protein. 


?proteinOMA oma:xrefUniprot ?proteinUniProt. 


Try running this query in the OMA SPARQL endpoint, 
https: //sparql.omabrowser.org/sparql. You might need to wait a 
couple of minutes to get the remote results. Next, try to look at the 
examples provided in the right side of the page to see how to get 
more properties of the proteinOMA variable—for example, try 
getting the description or the OMA ID. Next, try modifying this 
query so that it can run in the UniProt SPARQL endpoint, invok- 
ing the OMA one as a service. Remember to get the relevant 
prefixes and define them in the header of the query first (“oma,” 
“orth”). You can get these by looking at “Namespace prefixes” in 
the OMA SPARQL Web page. Finally, test your modifications 
using UniProt, http://sparql.uniprot.org 
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High-Performance Computing in Bayesian Phylogenetics 
and Phylodynamics Using BEAGLE 
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Abstract 


In this chapter, we focus on the computational challenges associated with statistical phylogenomics and how 
use of the broad-platform evolutionary analysis general likelihood evaluator (BEAGLE), a high- 
performance library for likelihood computation, can help to substantially reduce computation time in 
phylogenomic and phylodynamic analyses. We discuss computational improvements brought about by the 
BEAGLE library on a variety of state-of-the-art multicore hardware, and for a range of commonly used 
evolutionary models. For data sets of varying dimensions, we specifically focus on comparing performance 
in the Bayesian evolutionary analysis by sampling trees (BEAST) software between multicore central 
processing units (CPUs) and a wide range of graphics processing cards (GPUs). We put special emphasis 
on computational benchmarks from the field of phylodynamics, which combines the challenges of phylo- 
genomics with those of modelling trait data associated with the observed sequence data. In conclusion, we 
show that for increasingly large molecular sequence data sets, GPUs can offer tremendous computational 
advancements through the use of the BEAGLE library, which is available for software packages for both 
Bayesian inference and maximum-likelihood frameworks. 


Key words Adaptive Markov chain Monte Carlo, Multipartite data, Generalized linear model, High- 
performance computing, BEAGLE, BEAST, Pathogen phylodynamics, Data integration, Bayesian 
phylogenetics, Phylogenomics 


1 Introduction 


Phylogenomics, a term coined by Eisen and Fraser [13], explores 
the intersection of evolutionary studies and genomic analyses. 
Accurate phylogenetic reconstruction using genomic data has 
important repercussions for answering particular questions in 
genome analysis, as phylogenomic analyses often involve estimating 
the underlying evolutionary history of sequences either as an inter- 
mediate goal or as an end point. The availability of more and more 
complete genomes can help to correct for phylogenetic reconstruc- 
tion artifacts and contradictory results that often appeared in 
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molecular phylogenies based on a single or few orthologous genes 
[21]. Expanding the number of characters that can be used in 
phylogenetic reconstruction from a few thousand to tens of 
thousands, these large quantities of data lead to reduced estimation 
errors associated with site sampling, to very high power in the 
rejection of simple evolutionary hypotheses and to high confidence 
in estimated phylogenetic patterns [4]. 

Among the phylogenetic reconstruction approaches that have 
attained widespread recognition, Bayesian inference has become 
increasingly popular, in large part due to the availability of open- 
source software packages such as the Bayesian evolutionary analysis 
by sampling trees (BEAST) software [11] and MrBayes [29]. Bayes- 
ian phylogenetic inference is based on a quantity called the poste- 
rior distribution of trees, which involves a summation over all trees 
and, for each tree, integration over all possible combinations of 
branch length and substitution model parameter values [20]. Ana- 
lytical evaluation of this distribution is practically infeasible, and 
hence needs to be approximated using a numerical method, the 
most common being Markov chain Monte Carlo (MCMC). The 
basic idea is to construct a Markov chain that has as its state space 
the parameters of the statistical model and a stationary distribution 
that is the posterior distribution of the parameters (including the 
tree) [20]. While MCMC integration has revolutionized the field of 
phylogenetics [34], the continuously increasing size of data sets is 
pushing the field of statistical phylogenetics to its limits. 

While promising approaches to improve MCMC efficiency 
have emerged recently from the field of computational statistics, 
such as sequential Monte Carlo (SMC; see, e.g., Doucet [10]) and 
Hamiltonian Monte Carlo (HMC; see, e.g., Neal [27]), these 
approaches do not yet find widespread use in phylogenetics. The 
primary difficulty in this adoption centers around the tree that 
encompasses both continuous and discrete random variables. 
Instead, considerable attention is being meted on techniques for 
parallelization [32] to improve phylogenetic software run-times. 
Obtaining sufficient samples from a Markov chain may take many 
iterations, due to the large number of trees that may describe the 
relationships of a group of species and high autocorrelation 
between the samples. It is therefore of critical importance to per- 
form each iteration in a computationally efficient manner, making 
optimal use of the available hardware. High-performance compu- 
tational libraries, such as the broad-platform evolutionary analysis 
general likelihood evaluator (BEAGLE) [3], can be useful tools to 
enable efficient use of multicore computer hardware (or even 
special-purpose hardware), while at the same time requiring mini- 
mal knowledge from the software user(s). 

In this chapter, we first introduce the BEAGLE software library 
and its primary purpose, characteristics, and typical usages in Sub- 
heading 2, along with the hardware specifications of the devices 
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used for benchmarks in this chapter. In Subheading 3, we present 
computational benchmarks on the different hardware devices for a 
collection of data sets that are typically analyzed with models of 
varying complexity. Subheading 4 presents a brief overview of 
studies for which GPU computing capabilities were critical to 
analyze the data in a timely fashion. Given the increasing capabil- 
ities over hardware devices, we present an interesting avenue for 
further research in Subheading 5, in the form of adaptive MCMC. 


2 The BEAGLE Library 


2.1 Principles 


2.1.1 Computing 
Observed Data Likelihoods 


BEAGLE [3] is a high-performance likelihood-calculation platform 
for phylogenetic applications. BEAGLE defines a uniform applica- 
tion programming interface (API) and includes a collection of 
efficient implementations for evaluating likelihoods under a wide 
range of evolutionary models, on graphics processing units (GPUs) 
as well as on multicore central processing units (CPUs). The BEA- 
GLE library can be installed as a shared resource, to be used by any 
software aimed at phylogenetic reconstruction that supports the 
library. This approach allows developers of phylogenetic software to 
share any optimizations of the core calculations, and any program 
that uses BEAGLE will automatically benefit from the improve- 
ments to the library. For researchers, this centralization provides a 
single installation to take advantage of new hardware and paralleli- 
zation techniques. 

The BEAGLE project has been very successful in bringing 
hardware acceleration to phylogenetics. The library has been 
integrated into popular phylogenetics software including BEAST 
[11], MrBayes [29], PhyML [19], and GARLI [35] and has been 
widely used across a diverse range of evolutionary studies. The 
BEAGLE library is free, open-source software licensed under the 
Lesser GPL and available at https: //beagle-dev.github.io. 


The most effective methods for phylogenetic inference involve 
computing the probability of observed character data for a set of 
taxa given an evolutionary model and phylogenetic tree, which is 
often referred to as the (observed data) likelihood of that tree. 
Felsenstein demonstrated an algorithm to calculate this probability 
[16], and his algorithm recursively computes partial likelihoods via 
simple sums and products. These partial likelihoods track the prob- 
ability of the observed data descended from an internal node con- 
ditional on a particular state at that internal node. 

The partial likelihood calculations apply to a subtree compris- 
ing a parent node, two child nodes, and connecting branches. It is 
repeated for each unique site pattern in the data (in the form of a 
multiple sequence alignment), for each possible character of the 
state space (e.g., nucleotide, amino acid, or codon), and for each 
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2.1.2 Parallel 
Computation 


internal node in the proposed tree. The computational complexity 
of the likelihood calculation for a given tree is O( p x f x n), 
where p is the number of unique site patterns in the sequence 
(typically on the order of 10?-10°), sis the number of states each 
character in the sequence can assume (typically 4 for a nucleotide 
model, 20 for an amino-acid model, or 61 for a codon model), and 
n is the number of operational taxonomic units (e.g., species and 
alleles). 

Additionally, the tree space is very large; the number of 
unrooted topologies possible for n operational taxonomic units is 
given by the double factorial function (2m — 5)!! [15]. Thus, to 
explore even a fraction of the tree space, a very large number of 
topologies need to be evaluated, and hence a very great number of 
likelihood calculations have to be performed. This leads to analyses 
that can take days, weeks, or even months to run. Further com- 
pounding the issue, rapid advances in the collection of DNA 
sequence data have made the limitation for biological understand- 
ing of these data an increasingly computational problem. For phy- 
logenetic inferences, the computation bottleneck is most often the 
calculation of the likelihoods on a tree. Hence, speeding up the 
calculation of the likelihood function is key to increasing the per- 
formance of these analyses. 


Advances in computer hardware, specifically in parallel architec- 
tures, such as many-core GPUs, multicore CPUs, and CPU intrin- 
sics (e.g., SSE and AVX), have created opportunities for new 
approaches to computationally intensive methods. The structure 
of the likelihood calculation, involving large numbers of positions 
and multiple states, as well as other characteristics, makes it a very 
appealing computational fit to these modern parallel processors, 
especially to GPUs. 

BEAGLE exploits GPUs via fine-grained parallelization of 
functions necessary for computing the likelihood on a (phyloge- 
netic) tree. Phylogenetic inference programs typically explore tree 
space in a sequential manner (Fig. 1, tree space) or with only a small 
number of sampling chains, offering limited opportunity for task- 
level parallelization. In contrast, the crucial computation of partial 
likelihood arrays at each node of a proposed tree presents an excel- 
lent opportunity for fine-grained data parallelism, which GPUs are 
especially suited for. The use of many lightweight execution threads 
incurs very low overhead on GPUs, enabling efficient parallelism at 
this level. 

In order to calculate the overall likelihood of a proposed tree, 
phylogenetic inference programs perform a post-order traversal, 
evaluating a partial likelihood array at each node. When using 
BEAGLE, the evaluation of these multidimensional arrays is off- 
loaded to the library. While each partial likelihood array is still 
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Fig. 1 Diagrammatic example of the tree sampling process and fine-grained parallel computation of 
phylogenetic partial likelihoods using BEAGLE for a nucleotide model problem with five taxa, nine site 
patterns, and four evolutionary rate categories. Each entry in a partial likelihood array L is assigned to a 
separate GPU thread t. In this example, 144 GPU threads are created to enable parallel evaluation of each 
entry of the partial likelihood array L(x) 


evaluated in sequence, BEAGLE assigns the calculation of the array 
entries to separate GPU threads, for computation in parallel (Fig. 1, 
partial likelihood). Further, BEAGLE uses GPUs to parallelize 
other functions necessary for computing the overall tree likelihood, 
thus minimizing data transfers between the CPU and GPU. These 
additional functions include those necessary for computing branch 
transition probabilities, for integrating root and edge likelihoods, 
and for summing site likelihoods. 

Multicore CPU parallelization through BEAGLE can only be 
done via multiple instances of the library, such that each instance 
computes a different data partition. Multiple CPU threads can be 
used (e.g., one for each partition) if the application program 
(BEAST, for the remainder of this chapter) creates the BEAGLE 
instances in separate computation threads, which will be the case 
when using BEAST. This approach suits the trend of increasingly 
large molecular sequence data sets, which are often heavily parti- 
tioned in order to better model the underlying evolutionary pro- 
cesses. BEAGLE itself does not employ any kind of load balancing 
nor are the site columns computed in individual threads. Each 
BEAGLE instance only parallelizes computation on CPUs via SSE 
vectorization. 

BEAGLE can also use GPUs to perform partitioned analyses, 
however for problem sizes that are insufficiently large to saturate 
the capacity of one device, efficient computation requires multiple 
GPUs. Recent progress has been made in parallelizing the compu- 
tation of multiple data subsets on one GPU [1], and future releases 
of BEAGLE will include this capability. 
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Fig. 2 Layer diagram depicting BEAGLE library organization, and illustration of API use. Arrows indicate 
direction and relative size of data transfers between the client program and library 


2.2 Design The general structure of the BEAGLE library can be conceptua- 
lized as layers (Fig. 2, brary), the upper most of which is the 
application programming interface. Underlying this API is an 
implementation management layer, which loads the available 
implementations, makes them available to the client program, and 
passes API commands to the selected implementation. 

The design of BEAGLE allows for new implementations to be 
developed without the need to alter the core library code or how 
client programs interface with the library. This architecture also 
includes a plugin system, which allows implementation-specific 
code (via shared libraries) to be loaded at runtime when the 
required dependencies are present. Consequently, new frameworks 
and hardware platforms can more easily be made available to pro- 
grams that use the library, and ultimately to users performing 
phylogenetic analyses. 

Currently, the implementations in BEAGLE derive from two 
general models. One is a serial CPU implementation model, which 
does not directly use external frameworks. Under this model, there 
is a standard CPU implementation, and one with added SSE intrin- 
sics, which uses vector processing extensions present in many CPUs 
to parallelize computation across character state values. The other 
implementation model involves an explicit parallel accelerator pro- 
gramming model, which uses the CUDA external computing 
framework to exploit NVIDIA GPUs. It implements fine-grained 
parallelism for evaluating likelihoods under arbitrary molecular 
evolutionary models, and thus harnessing the large number of 
processing cores to efficiently perform calculations [3, 32]. 


2.2.1 Library 
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2.2.2 Application 
Programming Interface 


2.3 Performance 


Recent progress has been made in developing new implemen- 
tations for BEAGLE, beyond those described here, thus expanding 
the range of hardware that can be used. Upcoming releases of the 
library will include additional support for CPU parallelism via a 
multi-threaded implementation and will support the OpenCL stan- 
dard, enabling the use of AMD GPUs [2]. 


The BEAGLE API was designed to increase performance via fine- 
scale parallelization while reducing data transfer and memory copy 
overhead to an external hardware accelerator device (e.g., GPU). 
Client programs, such as BEAST [11], use the API to offload the 
evaluation of tree likelihoods to the BEAGLE library (Fig. 2, API). 
API functions can be subdivided into two categories: those which 
are only executed once per inference run and those which are 
repeatedly called as part of the iterative sampling process. As part 
of the one-time initialization process, client programs use the API 
to indicate analysis parameters such as tree size and sequence 
length, as well as specifying the type of evolutionary model 
and hardware resource(s) to be used. This allows BEAGLE to 
allocate the appropriate number and size of data buffers on device 
memory. Additionally at this initialization stage, the sequence data 
is specified and transferred to device memory. This costly memory 
operation is only performed once, thus minimizing its impact. 

During the iterative tree sampling procedure, client programs 
use the API to specify changes to the evolutionary model and 
instruct a series of partial likelihood operations that traverse the 
proposed tree in order to find its overall likelihood. BEAGLE 
efficiently computes these operations and makes the overall tree 
likelihood as well as per-site likelihoods available via another 
API call. 


Peak performance with BEAGLE is achieved when using a high- 
end GPU; however, the relative gain over using a CPU depends on 
model type and problem size as more demanding analyses allow for 
better utilization of GPU cores. Figure 3 shows speedups relative to 
serial CPU code when using BEAGLE with an NVIDIA P100 
GPU for the critical partial likelihood function, with increasing 
unique site pattern counts and for two model types. Computing 
these likelihoods typically accounts for over 90% of the total execu- 
tion time for phylogenetic inference programs and the relationship 
between speedups and problem size observed here primarily 
matches what would be observed for a full analysis. 

Figure 3 includes performance results for computing partial 
likelihoods under both nucleotide and codon models. The vertical 
axis labels show the speedup relative to the average performance of 
a baseline serial, single threaded and non-vectorized, CPU imple- 
mentation. This nonparallel CPU implementation provides a 
consistent performance level across different problem sizes and 
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Fig. 3 Plots showing BEAGLE partial likelihood computation performance on the GPU relative to serial CPU 
code, under nucleotide and codon models and for an increasing number of unique site patterns. Speedup 
factors are on a log-scale 


provides a relevant point of comparison as most phylogenetic infer- 
ence software packages use serial code as their standard. 

Using a nucleotide model, relative GPU performance over the 
CPU strongly scales with the number of site patterns. For very 
small numbers of patterns, the GPU exhibits poor performance 
due to greater execution overhead relative to overall problem size. 
GPU performance improves quickly as the number of unique site 
patterns is increased and by 10,000 patterns it is closer to a satura- 
tion point, continuing to increase but more slowly. At 100,000 
nucleotide patterns, the GPU is approximately 64 times faster than 
the serial CPU implementation. 

For codon-based models, GPU performance is less sensitive to 
the number of unique site patterns. This is due to the better 
parallelization opportunity afforded by the 61 biologically mean- 
ingful states that can be encoded by a codon. The higher state 
count of codon data compared to nucleotide data increases the 
ratio of computation to data transfer, resulting in increased GPU 
performance for codon-based analyses. For a problem size with 
10,000 codon patterns, the GPU is over 256 times faster than the 
serial CPU implementation. 


2.4 Memory Usage When assessing the suitability of a phylogenetic analysis for GPU 
acceleration via BEAGLE, it is also important to consider if the 
GPU has sufficient on-board memory for the analysis to be per- 
formed. GPUs typically have less memory than what is available to 
CPUs, and the high transfer cost of moving data from CPU to 
GPU memory prevents direct use of CPU memory for GPU 
acceleration. 

Figure 4 shows how much memory is required for problems of 
different sizes when running nucleotide and codon-model analyses 
in BEAST with BEAGLE GPU acceleration. Note that when mul- 
tiple GPUs are available, BEAST can partition a data set into 
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Fig. 4 Contour plots depicting BEAGLE memory usage on GPUs for BEAST nucleotide and codon-model 
analyses with 4 rate categories in double-precision floating-point format, for a range of problem sizes with 
different numbers of taxa and of unique site patterns. Memory requirements shown here assume an 
unpartitioned dataset. Partitioned analyses and more sophisticated models that use multiple BEAGLE 
instances incur memory overhead per additional library instance. Black dots indicate memory usage require- 
ments for the unpartitioned version of three data sets subsequently described in this chapter 


2.5 Hardware 


separate BEAGLE instances, one for each GPU. Thus, each GPU 
will only require as much memory as necessary for the data subset 
assigned to it. Typical PC-gaming GPUs have 8 GB of memory or 
less, while GPUs dedicated to high-performance computing, such 
as the NVIDIA Tesla series, may have as much as 24 GB of memory. 


Highly parallel computing technologies such as GPUs have over- 
taken traditional CPUs in peak performance potential and continue 
to advance at a faster pace. Additionally, the memory bandwidth 
available to the processor is especially relevant to data-intensive 
computations, such as the evaluation of nucleotide model likeli- 
hoods. In this measure as well, high-end GPUs significantly out- 
perform equivalently positioned CPUs. 

BEAGLE was designed to take advantage of this trend of 
increasingly advanced GPUs and uses runtime compilation meth- 
ods to optimize code for whichever generation of hardware is being 
used. Table 1 lists hardware specifications for the processors used in 
this chapter. We note that further advancements in the GPU market 
for scientific computing are on its way, with NVIDIA preparing the 
launch (at the time of writing) of the Tesla V100 in Q3 of 2017. 
The new NVIDIA Tesla V100 features a total of 5120 CUDA cores 
and comes equipped with 32 GB of on-board memory with 
900 GB/s of bandwidth. As such, it seems to have the potential 
to reach 7.5 TFLOPs in double-precision peak performance 


(DP PP), a roughly 50% increase over their current flagship, the 
Tesla P100. 
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Table 1 


Hardware specifications for the Intel CPUs and NVIDIA GPUs used in this chapter 


Hardware Year Cores Memory Bandwidth DP PP 

Xeon E5-2680v2 2013 2x 10 64 GB 60 GB/s 0.45 TFLOPS 
Xeon E5-2680v3 2014 2 x I8 64 GB 68 GB/s 0.96 TFLOPS 
GTX 590 2011 2 x 512 2 x 1.5 GB 164 GB/s 0.31 TFLOPS 
Tesla K20X 2012 2688 6 GB 250 GB/s 1.31 TFLOPS 
Tesla K40 2013 2880 12 GB 288 GB/s 1.43 TELOPS 
Quadro P5000 2016 2560 16 GB 288 GB/s NA 

Tesla P100 2016 3584 16 GB 720 GB/s 4.70 TFLOPS 


Estimated performance in double-precision peak performance (DP PP) taken from the manufacturer’s website. Note that 
the Quadro P5000 GPU only lists performance in single precision and we hence list its double-precision performance as 


not available (NA) 


3 Results 


In this section, we compare the performance of various typical 
Bayesian phylogenetic, phylogenomic, and phylodynamic analyses 
on different multicore architectures. In Subheading 3.1, we analyze 
a data set of mitochondrial genomes [32] using a high-dimensional 
model of codon substitution which, albeit low in number of para- 
meters, is particularly challenging in phylogenetic analyses specifi- 
cally because of the high-dimensional state space. In Subheading 
3.2, we analyze the largest Ebola virus data set at the time of 
publication [12] using a collection of nucleotide substitution mod- 
els, i.e., one per codon position and an extra one for analyzing the 
intergenic regions, where the large number of taxa and unique site 
patterns offer an interesting test case for the comparison between 
CPU and GPU performance. Finally, Subheading 3.3 reports on 
the performance of analyzing data sets that complement sequence 
data with discrete trait data (typically host data or geographic data), 
for which transition rates between (a potential large number of) 
discrete trait states are parameterized as a generalized linear model 
(GLM). All performance evaluations in this results section were run 
for 100,000 iterations (which is usually insufficient to achieve 
convergence) in BEAST v1.8.4 [11], using double precision 
(both on CPU and GPU) and in conjunction with BEAGLE 
v2.1.2 [3]. By default, BEAST—through BEAGLE—uses SSE2 
(Streaming SIMD Extensions 2), an SIMD instruction set exten- 
sion to the x86 architecture, when performing calculations 
on CPU. 
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3.1 Carnivores 


Selection is a key evolutionary process in shaping genetic diversity 
and a major focus of phylogenomics investigations [23]. Research- 
ers frequently evaluate the strength of selection operating on genes 
or even individual codons in the entire phylogeny or in a subset of 
branches using statistical methods. Codon substitution models 
have been particularly useful for this purpose because they allow 
estimating the ratio of non-synonymous and synonymous substitu- 
tion rates (aN/dS) in a phylogenetic framework. Goldman and 
Yang [18] and Muse and Gaut [26] developed the first codon- 
based evolutionary models (GY and MG, respectively), i.e., models 
that have codons as their states, incorporating biologically mean- 
ingful parameters such as transition/transversion bias, variability of 
a gene, and amino acid differences. 

Full codon substitution models are computationally expensive 
compared to standard nucleotide substitution models due to their 
large state space. Compared to nucleotide models (4 states) and 
amino acid models (20 states), a full vertebrate mitochondrial 
codon model has 60 states (ignoring the four nonsense or stop 
codons). We restrict ourselves to the standard GY codon substitu- 
tion model implementation in BEAST [11], employ the standard 
assumption that mutations occur independently at the three codon 
positions and therefore only consider substitutions that involve a 
single-nucleotide substitution, and assume that codons evolve 
independently from one another. Additionally, we allow for substi- 
tution rate heterogeneity among codons using a discrete gamma 
distribution (i.e., each codon is allowed to evolve at a different 
substitution rate) [33], which increases the computational 
demands of such an analysis fourfold (given that we allow for the 
standard assumption of four discrete rate categories). 

As a first application of using state-of-the-art hardware in sta- 
tistical phylogenetics, we reevaluate the performance of a full codon 
model on a set of mitochondrial genomes from extant carnivores 
and a pangolin outgroup [4, 32]. This genomic sequence align- 
ment contains 10,869 nt columns that code for 12 mitochondrial 
proteins and when translated into a single 60-state vertebrate mito- 
chondrial codon model, yields a total of 3623 alignment columns, 
of which 3601 site patterns are unique [32]. Figure 5 shows a 
comparison of the computational throughput between various 
CPU and GPU computing platforms. To this end, we make use 
of an option in BEAST [11] to split an alignment into two or more 
pieces of equal length, with each resulting alignment being evalu- 
ated on a separate processor core or computing device for optimal 
performance. Figure 5 shows that the analysis scales remarkably 
well on CPU, where the use of each additional processor core 
results in a performance increase. This can be attributed to the 
use of full codon models, which invokes a higher workload when 
evaluating each likelihood and hence more concurrent evaluation 
compared to thread communication. As the evaluation of the total 
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Fig. 5 Performance evaluation of different CPU and GPU configurations when estimating a single full codon 
substitution model on a full genome mitochondrial data set. Left: performance comparison on multicore CPUs 
(with the dashed line indicating optimal GPU performance). Right: performance comparison on GPUs. The 
numbers below the bars indicate the number of partitions/threads used in each analysis. The GTX 590, a GPU 
released in 2011 for PC gaming, performs equally well as a 24-core CPU configuration. However, GPUs from 
the Tesla generation, aimed at scientific computing, drastically outperform the CPU system, with the recently 
released P100 showcasing the impressive improvements in the GPU market, running 166 times faster than a 
single-core CPU setup 


amount of unique site patterns is split over more processor cores, 
the workload per core decreases and the communication overhead 
increases, resulting in smaller relative performance increases. 

The performance of a 24-core CPU setup is easily matched by a 
single GPU (the GTX 590) that was originally aimed at the market 
of PC gaming. However, subsequent improvements in GPU cards 
for scientific computing have yielded impressive performance gains, 
with a single Tesla K20 GPU outperforming 2 GTX 590 GPUs. 
Whereas the advent of the Tesla K40 offered further performance 
increases, it was mainly welcomed for having twice the amount of 
on-board memory, allowing for much larger data sets to be ana- 
lyzed on GPU. The recent introduction of the Tesla P100 GPU 
promised and delivered astonishing results, as shown in Fig. 5, with 
a single Tesla P100 GPU delivering six times the performance of a 
Tesla K40 GPU on these high-dimensional full codon models. We 
conclude that the use of a high-performance computational library 
such as BEAGLE, in combination with a powerful GPU, has signif- 
icantly facilitated the evaluation of and phylogenetic inference with 
full codon models. 
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3.2 Ebola Virus 


3.3 Phylogeography 


The original developments within the BEAGLE library offered 
considerable computational speedup when evaluating codon mod- 
els—up to a 52-fold increase when employing three GPU cards— 
and nucleotide models—up to a 15-fold increase when using three 
GPU cards—in double precision [32]. This may have resulted in 
the perception that GPU cards are mainly useful when evaluating 
codon models, but that the benefit for fitting models was not 
sufficiently substantial to warrant GPUs. To offer an objective 
assessment of the usefulness of GPUs in such cases, we analyze 
the use of various CPU and GPU configurations on a full genome 
Ebola virus data set, consisting of 1610 publicly available genomes 
sampled over the course of the 2013-2016 Ebola virus disease 
epidemic in West Africa [12] (we discuss this study in more detail 
in Subheading 4). This data set encompassing 18,992 nt columns is 
modelled with four partitions: one for each codon position and one 
additional partition for the intergenic region (which consists of 
several noncoding regions interspersed in the genome). The three 
codon partitions contain, respectively, 2366, 2328, and 2731 
unique site patterns, while the intergenic partition contains 2785 
unique site patterns. We model among-site rate variation [33] in 
each partition independently, which confronts us with a computa- 
tionally demanding analysis for this large number of taxa and 
unique site patterns. 

Figure 6 shows how the performance of such a large nucleotide 
data set scales with the available CPU and GPU resources. Contrary 
to the carnivores data set analysis in Subheading 3.1, this analysis 
does not scale particularly well with the number of CPU cores 
available, as the main benefit lies with splitting each partition into 
two subpartitions and only limited performance gains can be 
observed when using additional partitions or threads. Popular 
single GPU cards for scientific computing—such as the Tesla 
kK40—match the optimal performance brought about by using 
16 CPU cores, and may provide a useful alternative to multicore 
CPU systems. However, the decreasing cost for increasingly parallel 
multicore CPU systems makes this a difficult matchup for slightly 
older GPUs. More recently introduced GPU cards, such as the 
Tesla P100, are able to deliver a substantial performance improve- 
ment over a multicore CPU setup, with two Tesla P100 GPUs 
running in parallel offering over twice the performance of a 
16-core CPU setup. We note that the GTX 590 cards, as well as a 
single Tesla K20 card, do not contain sufficient on-board memory 
to hold the full data set and as such, these benchmarks could not be 
run on those resources. 


As shown in the results in Subheadings 3.1 and 3.2, different 
partitions of the aligned sequence data can contain a large number 
of unique site patterns, rendering phylogenomic inference chal- 
lenging. However, other data types are also included more 
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Fig. 6 Performance evaluation of different CPU and GPU configurations when estimating a four-partition 
nucleotide substitution model on a full genome Ebola virus data set. Left: performance comparison on 
multicore CPUs (with the dashed line indicating optimal GPU performance). Right: performance comparison 
on GPUs. The numbers below the bars indicate the number of partitions/threads used in each analysis. 
Because of the lower dimensionality compared to full codon models, nucleotide substitution models currently 
take less advantage of the large amounts of cores present on GPUs. The fastest GPU configuration— 
consisting of two Tesla P100 GPUs—outperforms by 107% the fastest CPU configuration that employs 
16 threads on a 24-core Haswell CPU 


frequently, such as trait data to be analyzed alongside the sequence 
data and hence potentially influencing the outcome of such an 
analysis (for an overview, see, e.g., Baele et al. [6]). 

Arguably, the most frequently considered traits in phylody- 
namics, and molecular evolution in general, are spatial locations. 
The interest in spatial dispersal has developed into its own research 
field referred to as phylogeography, with Bayesian inference of 
discrete phylogenetic diffusion processes being adopted in the 
field of biogeography [30]. Jointly estimating the phylogeny and 
the trait evolutionary process, Lemey et al. [23] implemented a 
similar Bayesian full probabilistic connection between sequences 
and traits in BEAST [11], with applications focusing on spatiotem- 
poral reconstructions of viral spread. These approaches offer exten- 
sive modelling flexibility at the expense of a quadratic growth in 
number of instantaneous rate parameters in the continuous-time 
Markov chain (CTMC) model as a function of the state dimension- 
ality of the trait. This can be seen in Fig. 7, which shows two maps 
with different numbers of (discrete) locations and the 
corresponding CTMC models that describe the instantaneous 
rates of transition between these locations. 
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Fig. 7 Graphical depiction of the Benelux countries (top) and the provinces of Belgium (bottom). When these 
countries and provinces are used as discrete location states in a discrete trait model, this yields, respectively, 
a3 x 3anda10 x 10 CTMC model with instantaneous rates of transition between each pair of locations. 
Such models are subject to the same restrictions as those used in popular substitution models, i.e., the rows 
sum to 0. As such, these CTMC models consist of, respectively, 6 and 90 free parameters to be estimated 


Many phylodynamic hypotheses can be addressed through the 
combination of genetic and trait data, but additional data in the 
form of covariates can help explain the evolutionary or epidemio- 
logical process. Such covariates can be used in a GLM formulation 
on a matrix of transition rate parameters between locations defining 
a CIMC process. Lemey et al. [25] developed an approach to 
simultaneously reconstruct spatiotemporal history and identify 
which combination of covariates associates with the pattern of 
spatial spread. This approach involves parameterizing each rate of 
among-location movement, typically denoted as the rh elements 
Lac) of the CTMC transition rate matrix, in the phylogeographic 
model as a log linear function of various potential covariates: 


logd;; = B61 xi 1 +282% j2 +++ + BNON%ij,N, (1) 


where 2; is the estimated effect size of covariate Au, 6; is a binary 
indicator that tracks the posterior probability of the inclusion of 
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3.3.1 Bat Rabies 


covariate x; in the model, and N equals the number of covariates; 
further, in the case of Fig. 7: 2,7 E {A, B, C} with N = 3 (top), and 
1,7 €{A, B, C, D, E, F, G, H, I, J} with N = 10 (bottom). Priors 
and posteriors for the inclusion probabilities (6) can be used to 
express the support for each predictor in terms of Bayes factors (for 
more information, see Baele et al. [6], Lemey et al. [25 ]). We discuss 
examples of such possible predictors in a phylogeographic setting in 
Subheading 4 but focus here on performance benchmarks for such 
generalized linear models. 


We here assess the performance of a phylodynamic setup aimed at 
reconstructing the spatial dispersal and cross-species dynamics of 
rabies virus (RABV) in North American bat populations based on a 
set of 372 nucleoprotein gene sequences (nucleotide positions: 
594-1353). The data set comprises a total of 17 bat species sam- 
pled between 1997 and 2006 across 14 states in the USA [31]. Two 
additional species that had been excluded from the original analysis 
owing to a limited amount of available sequences, Myotis austror- 
iparius (Ma) and Parastrellus hesperus (Ph), are also included here 
[14]. We also include a viral sequence with an unknown sampling 
date (accession no. TX5275, sampled in Texas from Lasiurus bor- 
ealis), which will be adequately accommodated in our inference. 
This leads to a total of 548 unique site patterns. Following Faria 
et al. [14], we employ two GLM-diffusion models for this analysis, 
one on the discrete set of 17 bat species and another on the discrete 
set of 14 location states. 

Figure 8 shows the performance of various multicore platforms 
on the bat rabies Bayesian phylodynamic analysis. In contrast to 
previous examples, the low number of sites (and hence unique site 
patterns) in the alignment does not offer many options for splitting 
the observed data likelihood over additional threads. While four 
CPU cores offer the optimal performance across our CPU plat- 
forms, using more threads for the analysis causes serious communi- 
cation overhead, slowing down the analysis. Comparing the CPU 
results with the GPU results shows that, across all multicore plat- 
forms tested, a 4-core CPU offers the best performance. 

Nonetheless, this scenario provides a very interesting use case 
for employing multiple graphics cards for scientific computing. 
Even though the (relatively small) dimensions of this particular 
example do not allow for a performance increase, it will be benefi- 
cial for higher-dimensional cases to compute each diffusion model 
on a separate GPU. When assuming independent diffusion pro- 
cesses that only depend on the underlying phylogeny, each of the 
trait diffusion models can be computed on a different GPU, 
whereas the data alignment can be split into two subpartitions of 
equal complexity (De, with an equal number of unique site pat- 
terns) and hence also be computed in parallel over the two GPUs. 
However, the limited sequence data size and the relatively restricted 
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Fig. 8 Performance evaluation of different CPU and GPU configurations when estimating a single-partition 
nucleotide substitution model on a rabies virus data set, along with two GLMs. Left: performance comparison 
on multicore CPUs (with the dashed line indicating optimal GPU performance). Right: performance comparison 
on GPUs. Given the short length of the alignment, communication overhead becomes quite severe when 
adding a large number of CPU processor cores. Even two state-of-the-art Tesla P100 GPUs fail to outperform 
four CPU cores, given the low number of sequences and the low number of states for the GLMs 


3.3.2 Ebola Virus 


number of discrete locations make this data set less suited for 
illustrating performance increases using GPUs. 


We here assess the performance of a similar setup as in the previous 
section, but using the data from the 2013-2016 West African 
epidemic caused by the Ebola virus [12]. We hence use the nucleo- 
tide data from the previous Ebola example (see Subheading 3.2) and 
augment it with location states. Using a phylogeographic GLM 
that integrates covariates of spatial spread, we have examined 
which features influenced the spread of EBOV among administra- 
tive regions at the district (Sierra Leone), prefecture (Guinea), and 
country (Liberia) levels. This resulted in a GLM parameterizing 
transition rates among 56 discrete location states according to 
25 potential covariates (see Dudas et al. [12] for more information), 
resulting in a computationally challenging analysis. 

As shown in Fig. 9, we have evaluated the performance of this 
challenging data set on our different multicore platforms. By com- 
paring these benchmarks with those in Fig. 6, it’s clear that the 
addition of a high-dimensional discrete trait model is much harder 
to process for any multicore CPU configuration. Adding more 
CPU cores to the analysis does not improve performance by 
much, indicative of the discrete trait model being the main 
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Fig. 9 Performance evaluation of different CPU and GPU configurations when estimating a four-partition 
nucleotide substitution model on a full genome Ebola virus data set along with a GLM-diffusion model for the 
location traits. Left: performance comparison on multicore CPUs (with the dashed line indicating optimal GPU 
performance). Right: performance comparison on GPUs. While a previous comparison of solely the nucleotide 
data resulted in a performance increase of over 100% for GPUs over CPUs, we observe a far more pronounced 
benefit for GPU in this case with two Tesla P100 GPUs outperforming a 12-core CPU setup by 568% 


4 Examples 


bottleneck in this analysis. This can be attributed to the high 
dimension of the discrete phylogeographic model [23] and the 
fact that this model describes a single column of characters, which 
cannot be split into multiple partitions. Relative to the computa- 
tional complexity of modelling the location states, splitting the 
observed sequence data over multiple partitions/threads yields 
relatively small performance improvements. 

Some of the (older) GPUs cannot fit the full data set in memory 
(such as the GTX 590 and a single Tesla K20), but those that are 
able to vastly outperform any CPU setup. Further, as these GPUs 
are better equipped to handle high-dimensional models, splitting 
the observed sequence data over multiple physical cards still yields 
noticeable performance gains. In contrast to Fig. 8, where two 
discrete phylogeographic models were used that could each be 
computed on different GPUs, the fact that this example only con- 
siders a single trait observation explains why less performance gains 
can be obtained by adding an additional GPU to the analysis. 


In this section, we highlight examples of large sequence data sets 
that are augmented with trait data in the form of discrete geo- 
graphic locations, for which BEAGLE [3] offers impressive 
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4.1 Human Influenza 
H3N2 


computational benefits, specifically when running these analyses on 
powerful graphics cards for scientific computing. Further, discrete 
phylogeographic models can be equipped with generalized linear 
models to identify predictors of pathogen spread. Both inclusion 
probabilities and conditional effect sizes for these predictors are 
estimated in order to determine support for such explanatory vari- 
ables of (pathogen) spread. 


A potentially powerful predictor for the behavior of influenza and 
other infectious diseases comes in the form of information on 
global human movement patterns, of which the worldwide air 
transportation network is by far the best studied system of global 
mobility in the context of human infectious diseases [8 ]. 

Lemey et al. [25] use a discrete phylogeographic model 
equipped with a GLM to show that the global dynamics of influ- 
enza H3N2 are driven by air passenger flows, whereas at more local 
scales spread is also determined by processes that correlate with 
geographic distance. For a data set that encompasses 1441 time- 
stamped hemagglutinin sequences (sampled between 2002 and 
2007) and up to 26 locations to be used in a discrete phylogeo- 
graphic model equipped with a GLM, BEAGLE can offer substan- 
tial performance gains. A snapshot of a visual reconstruction 
through geographic space is presented in Fig. 10, which includes 
a summary of the support for the collection of covariates in the 
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Fig. 10 Snapshot of the geographic spread of human influenza subtype H3N2, based on 1441 hemagglutinin 
sequences sampled between 2002 and 2007 [25]. A discrete phylogeographic approach was used, allocating 
the sequence data into a discrete number of locations and employing a generalized linear model on the 
parameters that model geographic spread. Inclusion probabilities and Bayes factor support are shown for the 
most prominent predictors of H3N2 geographic spread. D3 visualization is made using SpreaD3 [7], with 
circular polygon areas proportional to the number of tree lineages maintaining that location at that time 
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4.2 Ebola Virus 


GLM that offer the strongest contribution to spatial spread among 
those tested. As illustrated in Fig. 10 (but see Lemey et al. [25] for 
additional data), there is strong evidence that air passenger flow is 
among the most dominant drivers of the global dissemination of 
H3N2 influenza viruses. Further, geographic spread is found to be 
inversely associated (data not shown; but see Lemey et al. [25 ]) with 
geographical distance between locations and with origin and desti- 
nation population densities, which may seem counterintuitive. As 
the authors state, this negative association of population density 
with viral movement may suggest that commuting is less likely, per 
capita, to occur out of, or into, dense subpopulations. 


During the two and a half years Ebola virus (EBOV) circulated in 
West Africa, it caused at least 28,646 cases and 11,323 deaths. As 
mentioned in Subheading 3.3.2, Dudas et al. [12] used 1610 
genome sequences collected throughout the epidemic, represent- 
ing over 5% of recorded Ebola virus disease (EVD) cases to recon- 
struct a detailed phylogenetic history of the movement of EBOV 
within and between the three most affected countries. This study 
considers a massive time-stamped data set that allows to uncover 
regional patterns and drivers of the epidemic across its entire dura- 
tion, whereas individual studies had previously focused on either 
limited geographical areas or time periods. The authors use the 
phylogeographic GLM to test which features were important in 
shaping the spatial dynamics of EVD during the West African 
epidemic (see Fig. 11). 

The phylogeographic GLM allowed Dudas et al. [12] to deter- 
mine the factors that influenced the spread of EBOV among admin- 
istrative regions at the district (Sierra Leone), prefecture (Guinea), 
and country (Liberia) levels. The authors find that EBOV tends to 
disperse between geographically close regions, with great circle 
distances having among the strongest Bayes factor support for 
inclusion in the GLM among all covariates tested (along with four 
other predictors). Additionally, both origin and destination popu- 
lation sizes are equally strongly and positively correlated with viral 
dissemination (see Fig. 11). Dudas et al. [12] conclude that the 
combination of the positive effect of population sizes with the 
inverse effect of geographic distance implies that the epidemic’s 
spread followed a classic gravity-model dynamic, with intense dis- 
persal between larger and closer populations. Finally, the authors 
found a significant propensity for virus dispersal to occur within 
each country, relative to internationally, suggesting that country 
borders may have provided a barrier for the geographic spread 
of EBOV. 
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Fig. 11 Snapshot of the geographic spread of the Ebola virus during the 2013-2016 West African epidemic, 
based on 1610 whole genome sequences [12]. A discrete phylogeographic approach was used, allocating the 
sequence data into a discrete number of locations and employing a generalized linear model on the 
parameters that model geographic spread. Inclusion probabilities and Bayes factor support are shown for 
the most prominent predictors of Ebola virus geographic spread. D3 visualization is made using SpreaD3 [7], 
with circular polygon areas proportional to the number of tree lineages maintaining that location at that time 


5 Adaptive MCMC 


The various data sets described in this chapter so far have shown the 
use and computational performance of a wide range of models in 
phylogenetic, phylogenomic, and phylodynamic research. Whether 
employing full codon models (e.g., the carnivores data set), codon 
partition models (e.g., the Ebola data set), or discrete phylogeo- 
graphic models, the number of parameters of a typical Bayesian 
phylogenetic analysis has increased drastically over the years. This is 
exacerbated by the use of partitioning strategies, resulting also in a 
potentially large array of likelihoods that need to be evaluated 
simultaneously, increasing run times for most phylogenetic ana- 
lyses. In a similar fashion, computational resources available to 
researchers have also markedly increased, both in the form of multi- 
core CPU technology and increasingly powerful graphics cards 
targeted towards scientific computing. The ubiquitous availability 
of multiprocessor and multicore computers practically has moti- 
vated the design of novel parallel algorithms to make efficient use of 
these machines [22, 32]. 
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Many Bayesian phylogenetics software packages, such as 
BEAST [11] and MrBayes [29], do not fully exploit the inherent 
parallelism of such multicore systems when analyzing partitioned 
data because they typically update one single parameter at a time 
(a practice called single-component Metropolis—Hastings; Gilks 
et al. [17]). Such a single parameter often belongs to an evolution- 
ary model for a single data partition, leading to only one of the 
potentially large collection of (observed) data likelihoods to be 
modified at any one time. Such a strategy does not use the compu- 
tational power of modern-day multiprocessor and multicore sys- 
tems to its full advantage. Updating all the models’ parameters at 
once however leads to multiple data likelihoods being modified 
simultaneously, thereby making better use of the resources offered 
by these multicore systems (see Fig. 12). 
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Fig. 12 Conceptual visualization on the potential benefits of an adaptive MCMC algorithm over single- 
component Metropolis—Hastings (green bars indicate that a processor is computing a specific likelihood). In 
Bayesian phylogenetics, the common practice of updating a single parameter (i.e., either a, b, c, or d ata time 
leaves many CPU cores idle, underusing the computational performance of such architecture. Adaptive MCMC 
allows to update a collection of continuous parameters simultaneously (i.e., a, b, c, and d), putting many cores 
(in this case: 4) to work in a parallel fashion. Quad-Core AMD Opteron processor silicon die is shown, courtesy 
of Advanced Micro Devices, Inc. (AMD), obtained from Wikimedia Commons 
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In recent work, Baele et al. [5] propose to use multivariate 
components to update blocks of parameters, leading to acceptance 
or rejection for all of those parameters simultaneously, rather than 
updating all the parameters one by one in a sequential fashion using 
low-dimensional or scalar components [17]. To this end, the 
authors developed an adaptable variance multivariate normal 
(AVMVN) transition kernel for use in Bayesian phylogenetics, 
based on the work of Roberts and Rosenthal [28], to simulta- 
neously estimate a large number of partition-specific parameters. 
Baele et al. [5] implemented this adaptive MCMC approach in the 
popular open-source BEAST software package [11], which enables 
this transition kernel to exploit the computational routines within 
the BEAGLE library [3]. The authors applied this transition kernel 
to a collection of clock model parameters, speciation model para- 
meters, coalescent model parameters, and partition-specific evolu- 
tionary model parameters (which include substitution model 
parameters, varying rates across sites parameters, and relative rate 
parameters), although this kernel may find its use on parameters in 
many additional models. 

Baele et al. [5] show that such an AVMVN transition kernel 
tremendously increases estimation performance over a standard set 
of single-parameter transition kernels. Importantly, the use of an 
AVMVN transition kernel requires a paradigm shift in assessing 
performance of transition kernels in MCMC. It is common to 
judge the performance of Bayesian phylogenetic software packages 
strictly by the time they take to evaluate proposed parameter values, 
often expressed in time per number of states or iterations. How- 
ever, comparing transition kernels that only require a single proces- 
sor core to evaluate a proposed value against transition kernels that 
require a collection of processor cores to evaluate a collection of 
proposed values simultaneously is unfair, as the latter will logically 
take up more time as this involves more (computational) work 
(on multiple processor cores). Hence, a fair comparison involves 
calculating the effective sample size (ESS) per time unit, as this 
takes into account differences in execution speed while still report- 
ing a main statistic of interest. 

We note that the approach of Baele et al. [5] has been shown to 
yield performance increases on CPU, but that it still needs to be 
tested on GPU. This is due to the specific design of the BEAGLE 
library [3], which evaluates a collection of BEAGLE likelihoods / 
instances sequentially on GPU. Current work is underway to sim- 
plify the run time process of the BEAGLE library on GPU, allowing 
for simultaneous evaluation of such a collection of instances. 
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6 Conclusions 


In this chapter, we have focused on the computational challenges 
associated with typical analyses in the fields of phylogenetics, phy- 
logenomics, and phylodynamics. We have provided a detailed 
description of how the BEAGLE library can employ multicore 
hardware to perform efficient likelihood evaluations and have 
focused on its interaction with the BEAST software package. 
Using benchmarks collected on a range of multicore hardware, 
both from the CPU and GPU market, we have shown that employ- 
ing the BEAGLE high-performance computational library can con- 
siderably decrease computation time on these different systems and 
this for data sets with different characteristics in terms of size and 
complexity. The BEAGLE library allows to simultaneously com- 
pute the likelihoods for different data partitions on different CPU 
cores or even on different hardware devices, such as multiple GPU 
cards. In addition, existing data partitions can be split into multiple 
subpartitions to be computed in parallel across multicore hardware, 
yielding potentially drastic performance increases as shown in the 
benchmarks discussed in this chapter. 

Having employed the BEAGLE library on state-of-the-art mul- 
ticore hardware for a range of commonly used evolutionary models, 
we conclude that the combination of using BEAGLE and running 
analyses on powerful graphics cards aimed at the scientific comput- 
ing market allows for massive performance gains for many challeng- 
ing data sets. Given that sequence data sets keep growing in size and 
are being complemented with associated trait data, we have paid 
particular attention to a popular discrete trait model that parame- 
terizes the transition rates between its states as a GLM, to allow for 
the inclusion of covariates to help explain transitions in the trait 
data. Graphics cards can be particularly useful when dealing with 
such models, as shown in the benchmarks presented, and we have 
hence presented a number of examples from the literature in which 
such a setup was used to perform the analyses. 

Discrete phylogeography approaches (or discrete trait ana- 
lyses), as the ones presented in this chapter, treat the sampling 
locations of the sequences as informative data, rather than uninfor- 
mative auxiliary variables [9, 23, 25]. As such, the posterior distri- 
bution of the parameters given the data contains not only the 
likelihood of the sequences given the genealogy and substitution 
model but also the likelihood of the sampling locations given the 
genealogy and migration matrix, calculated by integrating over all 
possible discrete state transition histories using Felsenstein’s prun- 
ing algorithm [16]. What makes this computationally demanding is 
a potential large number (equal to the number of branches in the 
phylogeny) of potentially high-dimensional (depending on the 
number of sampling locations) matrices, which can be parallelized 
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across a large number of computing cores such as those found on 
a GPU. 

Similarly, structured coalescent approaches also contain the 
likelihood of the sequences given the genealogy and substitution 
model in their posterior distribution of the parameters given the 
data. The use of BEAGLE will yield equal benefits to both 
approaches when it comes to the computation of the likelihood of 
the sequences given the genealogy and substitution model. How- 
ever, rather than the likelihood of the sampling locations, 
structured coalescent approaches require computation of the prob- 
ability density of the genealogy and migration history under the 
structured coalescent given the migration matrix and effective pop- 
ulation sizes. To compute this density, a product of exponentials— 
one for each of the time intervals between successive events (coa- 
lescence, sampling, or migration)—needs to be calculated. If the 
number of demes is sufficiently large, a GPU implementation of the 
probability density of the genealogy and migration history under 
the structured coalescent may be able to compute the contribution 
to this density for each of those time intervals in a highly parallel 
manner. 

Approximations to the structured coalescent include, for exam- 
ple, BASTA [9], which aims to compute the probability density of 
the genealogy under the structured coalescent, integrated over 
migration histories. The computational bottleneck of this approach 
lies with calculating and updating the probability distribution of 
lineages among demes, over all lineages and over all coalescent 
events. This involves computing the matrix exponential of the 
product of each time interval duration with the backwards-in- 
time migration rate matrix, of which the diagonal elements are 
defined such that the rows sum to zero. BEAGLE is equipped 
with a parallel thread block design for computing such finite-time 
transition probabilities, and to construct the finite-time transition 
probabilities in parallel across all lineages, and therefore has the 
potential to provide performance increases for structured coales- 
cent approximations such as BASTA. However, the application 
software that calls upon BEAGLE needs to be implemented to 
rely on BEAGLE’s API in order to achieve the corresponding 
performance increases. 

Graphics cards aimed at the scientific computing market have 
traditionally offered roughly three times the single-precision per- 
formance compared to their double-precision performance (Tesla 
K40, K20X and K20). Previous generation cards, such as the Tesla 
K10, offered poor double-precision performance and focused 
solely on single-precision performance (up to 24 times their 
double-precision performance). The latest generation of GPUs, 
specifically the Tesla P100, offers tremendous double-precision 
performance, while single-precision performance is still twice as 
high. We therefore expect a doubling in performance for the 
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7 Notes 


computations described in this chapter if we would run them in 
single precision on GPU, provided that the decrease in accuracy 
would not lead to rescaling issues which would slow down the 
evaluations. 

In theory, single-precision likelihood evaluations will be twice 
as fast as double-precision likelihood evaluations on CPU as well, 
but with more rescaling issues hampering performance. However, 
the influence of switching to single precision is more difficult to 
assess for CPUs, as there are a number of other factors to consider. 
Single-precision floating points are half the size compared to 
double-precision floating points and hence they may fit into a 
lower level of cache with a lower latency, which potentially frees 
up cache space to cache more (or other) data. Additionally, they 
require half the memory bandwidth, which frees up that bandwidth 
for other operations to be performed. Nevertheless, the total over- 
all bandwidth will still be limited compared to that of powerful 
graphics cards and this will not suffice to bridge the performance 
differences between CPUs and GPUs in phylogenetics. 

Finally, we have presented an interesting avenue for further 
increasing computational performance on multicore hardware, in 
the form of a new adaptive MCMC transition kernel. Traditional 
MCMC transition kernels generally update single parameters in a 
serial fashion triggering sequential likelihood evaluations on single 
cores. The adaptive transition kernel however updates a collection 
of continuous parameters simultaneously, triggering multiple like- 
lihood evaluations in parallel on multiple cores and hence allowing 
for potentially large improvements in computational efficiency. 
Further research into this area is needed to continuously advance 
MCMC kernels and keep computation time manageable for a wide 
range of models in Bayesian phylogenetics. 


l. We have showcased the potentially impressive performance 
gains brought about by using BEAGLE in conjunction with 
powerful graphics cards. However, users sometimes complain 
about the poor performance gains they experience when using 
a GPU for their analyses, which may have to do with their GPU 
being not particularly suited for scientific computing. We urge 
readers to be cautious as to which GPU they invest in, as there 
is an important distinction between graphics cards aimed at the 
gaming market and those aimed at the scientific computing 
market. Computer gaming cards mainly offer tremendous 
single-precision performance, but typically weak double- 
precision performance. We hence advise to invest in GPUs 
aimed at scientific computing, offering increased accuracy and 
performance in double precision. As a rule, computer gaming 
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8 Exercises 


cards have a much reduced cost compared to scientific comput- 
ing cards, but we advise readers to check the technical specifica- 
tions of the card before purchase. 


. While 32-bit operating systems are no longer the norm, such 


systems are still being used from time to time, and problems 
have been reported in the use and/or installation of BEAGLE 
on such systems. We strongly advise to install and run BEAST 
together with BEAGLE on a 64-bit operating system and, in 
the case of problems, urge users to check that their Java instal- 
lation is a proper 64-bit installation as well (e. avoid 32-bit or 
mixed mode software installations). The BEAGLE website, 
hosted at https://github.com/beagle-dev/beagle-lib, con- 
tains installation instructions for Windows, Linux/Unix, and 
Mac systems. 


. While powerful GPUs can be purchased and installed in desk- 


top computers for immediate use, high-performance comput- 
ing (HPC) centers or computing clusters can also be equipped 
with GPUs. These systems typically run a job scheduler that 
allows users to submit BEAST analyses to either CPU or GPU 
nodes. In case the requested resources are not immediately 
available, the submitted job is placed in a queue until those 
resources become available, which may take some time. We 
hence strongly advise users (especially those who manually 
compose their input files) to first test their BEAST XML files 
on a local desktop machine with BEAGLE installed, in order to 
not have wasted precious time in a job queue only to find out 
the BEAST XML cannot be run properly. 


1. An important aspect to getting computations—such as those 


discussed in this chapter—up and running, is defining which 
hardware is available on your computer or server. This can 
easily be checked using BEAST once BEAGLE has been 
installed. To check this when using the BEAST graphical user 
interface (GUI), simply check the box that says “Show list of 
available BEAGLE resources and Quit”; alternatively, when 
using the command-line interface using a BEAST Java Archive 
(or JAR) file—which can usually be found in the 1ib directory 
within the BEAST folder—you can simply type: 


java -jar beast.jar -beagle_info 


If the path to the BEAGLE library hasn’t been set up 
automatically, be sure to add its location to the command by 


adding: 
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java -Djava.library.path=/usr/local/lib ... 


On a typical desktop system equipped with a GPU fit for 
scientific computing, this will yield the following output to 
screen: 


Using BEAGLE library v2.1.2 for accelerated, parallel 
likelihood evaluation. 2009-2013, BEAGLE Working Group. 
Citation: Ayres et al (2012) Systematic Biology 61: 170-173 


BEAGLE resources available: 
0 : CPU 
Flags: PRECISION_SINGLE PRECISION_DOUBLE 


Intel(R) HD Graphics 530 
Global memory (MB): 1536 
Clock speed (Ghz): 1.05 

Number of compute units: 24 

Flags: PRECISION_SINGLE COMPUTATION_SYNCH ... 
esla K40c 
lobal memory (MB): 11520 
lock speed (Ghz): 0.74 
umber of cores: 2880 
lags: PRECISION_SINGLE PRECISION_DOUBLE 


T 
G 
C 
N 
F 


In order to determine which resource to use for your 
computations, it’s important to look into the specifications of 
the GPU as listed by the hardware vendor. For example, certain 
GPUs will be equipped with a large number of cores and yet 
they’re aimed at the computer gaming market, which will result 
in poor double-precision performance. As we have shown in 
Table 1, the Tesla brand is typically well suited for GPU com- 
puting, but other cards may be appropriate as well if they 
deliver adequate double-precision peak performance. In the 
output printed above, it’s quite obvious that weil be interested 
in running our analyses on resource 2, i.e., a GPU equipped 
with thousands of computing cores (resource 1 is an integrated 
graphics unit, mainly fit for delivering graphics output to 
screen). 


. Once you have located a GPU fit for scientific computing on 


your desktop computer or server, try to perform your analysis 
both on the system’s CPU and GPU to compare performance. 
Using BEAST’s GUI, the default option is to run on CPU; if 
you'd like to run your analysis on a suitable GPU, use BEAST’s 
GUI to select “GPU” where it says “Prefer use of:.” However, 
most desktop computers don’t come equipped with powerful 
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graphics cards, and most often servers aimed at high- 
performance computing (HPC) will be used for performing 
these types of computations. As such servers are typically 
instructed using a command-line interface, BEAST offers the 
possibility to assign computations to one or more specific 
GPUs. Using the system described here, it would hence make 
sense to run your analysis on resource 2, which can be done as 
follows: 


java -jar beast.jar -beagle_gpu -beagle_order 2 data.xml 


Note that not specifying -beagle_order will result in the 
analysis being run on the system’s CPU, i.e., resource 0. Addi- 
tionally, when employing a GPU for your analyses, adding the 
-beagle_gpu argument is highly advised. Many different 
combinations of using resources arise when your data set is 
partitioned into multiple subsets, for example if your data is 
partitioned according to gene and/or codon position. In such 
cases, it may be beneficial to split those partitions onto multiple 
resources by using the -beagle_order command-line 
option. For example, the Ebola virus data set (without trait 
data) has four partitions; it may be useful (although this 
depends on the actual hardware and needs to be tested) to 
compute the likelihood of one partition on the CPU (ie., 
resource 0) and the other three likelihoods on the GPU (i.e., 
resource 2). This can be done as follows: 


java -jar beast.jar -beagle_gpu -beagle_order 0,2,2,2 


ebola.xml 


3. In some cases, such as for example the carnivores data set 
analyzed in this chapter, only one (sequence) data partition is 
available. On CPU, drastic performance improvements can still 
be achieved by using a BEAGLE feature that allows to split up a 
data partition into multiple subsets, as can be seen in Fig. 5. 
This approach will lead to performance increases on most CPU 
systems, as many laptops now come equipped with 4-core 
processors; this can hence easily be tested on the system 
yow’re currently using. To split a (sequence) data partition 
into two subsets, you can use the following command: 


java -jar beast.jar -beagle_instances 2 carnivores.xml 


To generate the results in Fig. 5, we have used this 
approach to split the data set into 2, 4, 8, 12, 16, 20, and 
24 subpartitions, increasing performance every step of the way. 
Note that on GPU, this approach will only lead to an increased 
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Abstract 


Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, 
proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the 
analysis and integration of data generated by such high-throughput technologies have become computa- 
tionally intensive, and analysis can no longer happen on a typical desktop computer. 

In this chapter we show how to describe and execute the same analysis using a number of workflow 
systems and how these follow different approaches to tackle execution and reproducibility issues. We show 
how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and 
run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different 
workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snake- 
make, and Nextflow. Each of which can be run in parallel. 

We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, 
and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, 
and Singularity. Together these distributions represent the overall majority of software packages relevant for 
biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight 
containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters. 

By bundling software through these public software distributions, and by creating reproducible and 
shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time 
reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in 
this chapter allow a quick comparison of different solutions. 


Key words Bioinformatics, Evolutionary biology, Big data, Parallelization, MPI, Cloud computing, 
Cluster computing, Virtual machine, MrBayes, Debian Linux, GNU Guix, Bioconda, CWL, Common 
Workflow Language, Guix Workflow Language, Snakemake, Nextflow 
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1 Introduction 


1.1 Overview 


In this chapter, we show how to create a bioinformatics pipeline 
using four workflow systems: CWL, GWL, Snakemake, and Next- 
flow. We show how to put them together, so you can adapt it for 
your own purposes while discussing in the process the different 
approaches. All scripts and source code can be found on GitHub. 
The online material allows a direct comparison of how such work- 
flows are assembled with their syntax. 

Due to large volumes, the analysis and integration of data 
generated by high-throughput technologies have become compu- 
tationally intensive, and analysis can no longer happen on a typical 
desktop computer. Researchers therefore are faced with the need to 
scale analyses efficiently by using high-performance compute 
clusters or cloud platforms. At the same time, they have to make 
sure that these analyses run in a reproducible manner. And in a 
clinical setting, time becomes an additional constraint, with moti- 
vation to generate actionable results within hours. 

In the case of evolutionary genomics, lengthy computations are 
often multidimensional. Examples of such expensive calculations 
are Bayesian analyses, inference based on hidden Markov models, 
and maximum likelihood analysis, implemented, for example, by 
MrBayes [1], HMMER [2], and phylogenetic analysis by maximum 
likelihood (PAML) [3]. Genome-sized data, or Big Data [4, 5], 
such as produced by high-throughput sequencers, as well as grow- 
ing sample size, such as from UK Biobank, the Million Veterans 
Program, and the other large genome-phenome projects, are 
exacerbating the computational challenges, e.g., [6]. 

In addition to being computationally expensive, many imple- 
mentations of major algorithms and tools in bioinformatics do not 
scale well. One example of legacy software requiring lengthy com- 
putation is Ziheng Yang’s CodeML implementation of PAML 
[3]. PAML finds amino acid sites that show evidence of positive 
selection using dn/ds ratios, i.e., the ratio of nonsynonymous and 
synonymous substitution rate. For further discussion see also 
Chapter. 12. Executing PAML over an alignment of 100 sequences 
may take hours, sometimes days, even on a fast computer. PAML 
(version 4.x) is designed as a single-threaded process and can only 
exploit one central processing unit (CPU) to complete a calcula- 
tion. To test hundreds of alignments, e.g., different gene families, 
PAML is invoked hundreds of times in a serial fashion, possibly 
taking days on a single computer. Here, we use PAML as an 
example, but the idea holds for any software program that is CPU 
bound, i.e., the CPU speed determines program execution time. A 
CPU bound program will be at (close to) 100% CPU usage. Many 
legacy programs are CPU bound and do not scale by themselves. 


1.2 Parallelization in 
the Cloud 
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Most bioinformatics (legacy) programs today do not make effective 
use of multi-core computers 


The reason most bioinformatics software today does not make 
full use of multicore computers or GPUs is because writing such 
software is difficult. (See also the text box below for a further 
treatment of this topic; see Box 1.) 

A common parallelization strategy in bioinformatics is to start 
with an existing nonparallel application and run it by dividing data 
into independent units of work or jobs which run in parallel and do 
not communicate with each other. This is also known as an “embar 
rassingly parallel” solution, and we will pursue this below. 


Cloud computing allows the use of “on-demand” CPUs accessible 
via the Internet and is playing an increasingly important role in 
bioinformatics. Bioinformaticians and system administrators previ- 
ously had to physically install and maintain large compute clusters 
to scale up computations, but now cloud computing makes it 
possible to rent and access CPUs, GPUs, and storage, thereby 
enabling a more flexible concept of on-demand computing 
[7]. The cloud scales and commoditizes cluster infrastructure and 
management and, in addition, allows users to run their own 
operating system, usually not true with existing cluster and GRID 
infrastructure (a GRID is a heterogeneous network of computers 
that act together). A so-called hypervisor sits between the host 
operating system and the guest operating system, and it makes 
sure they are clearly separated while virtualizing host hardware. 
This means many guests can share the same machine that appears 
to the users as a single machine on the network. This allows 
providers to efficiently allocate resources. Containers are another 
form of light virtualization that is now supported by all the main 
cloud providers, such as Google, Microsoft, Rackspace OpenStack, 
and Amazon (AWS). Note that only OpenStack is available as free 
and open-source software. 

An interesting development is that of portable batch systems 
(PBS) in the cloud. PBS-like systems are ubiquitous in high- 
performance computing (HPC). Both Amazon EC2 and Microsoft 
Cloud offer batch computing services with powerful configuration 
options to run thousands of jobs in the cloud while transparently 
automating the creation and management of virtual machines and 
containers for the user. As an alternative, Arvados is an open-source 
product specifically aimed at bioinformatics applications that makes 
the cloud behave as if it is a local cluster of computers, e.g., [8]. 

At an even higher level, MapReduce is a framework for 
distributed processing of huge datasets, and it is well suited for 
problems using large number of computers [9]. The map step takes 
a dataset and splits it into parts and distributes them to worker 
nodes. Worker nodes can further split and distribute data. At the 
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1.3 A Pipeline for the 
Cloud 


1.4 Parallelization of 
Applications Using a 
Workflow 


reduce step, data is combined into a result, i.e., it is an evolved 
scatter and gather approach. An API is provided that allows pro- 
grammers to access functionality. The Apache Hadoop project 
includes a MapReduce implementation and a distributed file system 
[10] that can be used with multiple cloud providers and also on 
private computer clusters. Another similar example is the Apache 
Spark project based on resilient distributed datasets (RDD)—a 
fault-tolerant collection of elements that can be accessed and oper- 
ated on in parallel. 

The advantage of such higher-level systems is that they go well 
beyond hardware virtualization: not only the hardware infrastruc- 
ture but also the operating system, the job scheduler, and resource 
orchestration are abstracted away. This simplifies data processing, 
parallelization, and the deployment of virtual servers and/or con- 
tainers. The downside is that users have less control over the full 
software stack and often needs to program and interact with an 
application programmers interface (API). 

Overall, in the last decade, both commercial and noncommer- 
cial software providers have made cloud computing possible. Bioin- 
formaticians can exploit these services. 


To create a bioinformatics pipeline, it is possible to combine remote 
cloud instances with a local setup. Prepare virtual machines or 
containers using similar technologies on a local network, such as a 
few office computers or servers, and then use these for calculations 
in the cloud when an analysis takes too long. The cloud computing 
resources may, for instance, support a service at peak usage, while 
regular loads are met with local infrastructure (i.e., burst compute). 
New ideas can be developed and pre-evaluated using modest 
in-house setups and then scaled to match the most demanding 
work. 


Cloud services can be used for burst computing — enabling local 
clusters to be much smaller — as small as a single computer 


In the following sections, we will provide instructions to deploy 
applications, and we will show how the use of workflow systems and 
reproducible environments can greatly simplify running scalable 
workflows on different environments, including the cloud. 


In case of embarrassingly parallel applications, programs are run 
independently as separate processes which do not communicate 
with each other. This is also a scatter and gather approach, i.e., 
inputs split into several jobs are fed into each process by the user. 
Job outputs are collected and collated. In bioinformatics, such tasks 
are often combined into computational pipelines. With the PAML 
example, each single job can be based on one alignment, potentially 
giving linear speed improvements by distributing jobs across multi- 
ple CPUs and computers. In other words, the PAML software, by 
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itself, does not allow calculations in parallel, but it is possible to 
parallelize multiple runs of PAML by splitting the dataset. The 
downside of this approach is the deployment and configuration of 
pipeline software, as well as the management and complexity of 
splitting inputs and the collecting and collating of outputs. Also, 
pipelines are potentially fragile, because there is no real interprocess 
communication. For example, it is hard to predict the conse- 
quences of a storage or network error in the middle of a week- or 
month-long calculation. 

Even for multithreaded applications that make use of multiple 
CPUs, such as BLAST and MrBayes, it is possible to scale up 
calculations by using a workflow. For example, MrBayes-MPI ver- 
sion 3.1.2 does not provide between-machine parallelization and is 
therefore machine bound, i.e., the machine’s performance deter- 
mines the total run time. Still, if one needs to calculate thousands of 
phylogenetic trees, discrete jobs can be distributed across multiple 
machines. A similar approach is often used for large-scale BLAST 
analyses over hundreds of thousands of sequences. 

A pipeline typically consists of linear components, where one 
software tool feeds into another, combined with a scattering of jobs 
across nodes and a gathering and collation of results. 

In existing compute clusters, to distribute work across nodes, 
portable batch system (PBS) schedulers are used, such as Slurm 
[11]. Many pipelines in bioinformatics are created in the form of 
Bash, Perl, or Python scripts that submit jobs to these schedulers. 
Such scripted pipelines have the advantage that they are easy to 
write and adaptable to different needs. The downside is that they 
are hard to maintain and not very portable, since the description of 
the environment and the software packages are not part of these 
scripts, reducing or completing preventing the reproducibility of a 
certain analysis in a different context. This has led to the current 
state of affairs in bioinformatics that it is surprisingly hard to share 
pipelines and workflows. As a result much effort is spent reinvent- 
ing the wheel. 


Most existing bioinformatics pipelines cannot easily be shared and 
reproduced 


In recent years, a number of efforts have started to address the 
problem of sharing workflows and making analyses reproducible. 
One example is the Common Workflow Language (CWL), a speci- 
fication for describing analysis workflows and tools in a way that 
makes them portable and scalable across a variety of environ- 
ments—from workstations to cluster, cloud, and HPC environ- 
ments. CWL is a large bioinformatics community effort. Different 
platforms support CWL, including Arvados, Galaxy, and Seven 
Bridges [8]. 
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A second workflow language is the Guix Workflow Language 
(GWL) built on top of the GNU Guix software deployment system. 
GWL aims to provide a deterministic and bit-reproducible analysis 
environment. 

A third workflow language and orchestrator, Nextflow, allows 
scalable and reproducible scientific workflows to run seamlessly 
across multiple platforms from local computers to HPC clusters 
and the cloud, offering a concise and expressive DSL to describe 
complex workflows. Nextflow is routinely used in organizations 
and institutes, such as the Roche Sequencing, the Wellcome Trust 
Sanger Institute, and the Center for Genomic Regulation (CRG) 
Nextflow workshop. 

Forth there is Snakemake, another widely used workflow man- 
ager system, written in Python and inspired by GNU Make. It 
allows for the composition of workflows based on a graph of rules 
whose execution is triggered by the presence, absence, or modifica- 
tion of expected files and directories. 

It is interesting to note that all these workflow languages and 
systems originated in bioinformatics. It suggests that in this rapidly 
growing field, the increasing computational needs and moreover 
the diverse demands made more formal solutions a necessity. It also 
suggests that existing workflow engines used in astronomy and 
physics, for example, have different requirements. 


Box 1: Understanding Parallelization 

Parallel computing is related to concurrent computing. In 
parallelized computing, a computational task is typically bro- 
ken down in several, often many, very similar subtasks that can 
be processed independently and whose results are combined 
afterward, upon completion, i.e., a simple scatter and gather 
approach. In contrast, in distributed computing, the various 
processes often do not address related tasks; or when they do, 
the separate tasks may have a varied nature and often require 
some interprocess communication during execution. The lat- 
ter is also a hallmark of supercomputing where compute 
nodes have high-speed connections. 

In the bioinformatics space, we usually discuss embarrass- 
ingly parallel computing which means similar tasks are 
distributed across multiple CPUs without interprocess com- 
munication. This can be among multiple cores within a single 
processor, a multiprocessor system, or a network of compu- 
ters, a so-called compute cluster. 

Even so, parallel multicore programming easily becomes 
complex. Typically, parallel programming has to deal with 
extra data and control flow; it has to deal with deadlocks, 
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Box 1: (continued) 

where depending tasks wait for each other forever and, with 
race conditions, where tasks try to modify a shared resource 
(e.g., a file) at the same time resulting in a loss of data or an 
undetermined condition. This introduces additional com- 
plexity in software development, bug hunting, and code 
maintenance. Typically it takes more time to debug such 
code than to write it. 


Writing programs that fully utilize multi-core architectures is hard 


Not only is parallel programming intrinsically complicated; 
programmers also have to deal with communication overheads 
between parallel threads. MrBayes, for example, a program for 
calculating phylogenetic trees based on Bayesian analysis, comes 
with MPI support. MPI is a message-based abstraction of paralle- 
lization, in the form of a binary communication protocol imple- 
mented in a C programming library [12]. In some cases the 
parallelized version is slower than the single CPU version. For 
example, the MPI version calculates each Markov chain in parallel, 
and the chains need to be synchronized with each other, in a 
“scatter and gather” pattern. The chains spend time waiting for 
each other in addition to the communication overheads introduced 
by MPI itself. Later MrBayes adopted a hybrid use of coarse- 
grained OpenMPI and fine-grained use of pthreads or OpenMP 
leading to improved scalability, e.g., [13]. 

Another example of communication overhead is with the sta- 
tistical programming language R [14], which does not have native 
threading support built into the language. One possible option is to 
use an MPI-based library which only allows coarse-grained paralle- 
lization from R, as each parallelized R thread starts up an R 
instance, potentially introducing large overheads, both in commu- 
nication time and memory footprint. For a parallelized program to 
be faster than its single-threaded counterpart, these communica- 
tion overheads have to be dealt with. 


Parallelization in R is coarse-grained with large overhead 


The need for scaling up calculations on multi-CPU computers 
has increased the interest in a number of functional programming 
languages, such as Erlang [15], Haskell [16], Scala [17], and Julia 
[18]. These languages promise to ease writing parallel software by 
introducing a higher level of abstraction of parallelization, com- 
bined with immutable data, automatic garbage collection, and 
good debugging support [5, 19]. For example, Erlang and Scala 
rely on Actors as an abstraction of parallelization and make 
reasoning about fine-grained parallelization easier and therefore 
less error prone. 
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GPU Programming 


Actors were introduced and explored by Erlang, a computer 
language originally designed for highly parallelized telecommuni- 
cations computing. To the human programmer, each Actor appears 
as a linear piece of programming and is parallelized without the 
complexity of locks, mutexes, and semaphores. Actors allow for 
parallelization in a manageable way, where lightweight threads are 
guaranteed to be independent and each has a message queue, 
similar to MPI. Actors, however, are much faster, more intuitive, 
and, therefore, probably, safer than MPI. Immutable data, when 
used on a single multi-CPU computer, allows fast passing of data by 
reference between Actors. When a computer language supports the 
concept of immutability, it guarantees data is not changed between 
parallel threads, again making programming less error prone and 
easier to structure. Actors with support for immutable data are 
implemented as an integral part of the programming language in 
Erlang, Haskell, Scala, Elixir, and D [20]. 

Another abstraction of parallelized programming is the intro- 
duction of goroutines, part of the Go programming language 
[21]. Where MPI and Actors are related to a concept of message 
passing and mail boxes, goroutines are more closely related to Unix 
named pipes. Goroutines also aim to make reasoning about paral- 
lelization easier, by providing a pipe where data goes in and results 
come out, and this processing happens concurrently without use of 
mutexes, making it easier to reason about linear code. Goroutines 
are related to communicating sequential processes (CSP), the orig- 
inal paper by Tony Hoare in 1978 [22]. Meanwhile, recent practical 
implementations are driven by the ubiquity of cheap multicore 
computers and the need for scaling up. A Java implementation of 
CSP exists, named JCSP [23], and a Scala alternative named CSO 
[24]. Go made goroutines intuitive and a central part of the 
strongly typed compiled language. 


Erlang, Elixir, Haskell, Scala, Julia, Go and D are languages offering 
useful abstractions and tools for multi-core programming 


It is important to note that the problems, ideas, and concepts of 
parallel programming are not recent. They have been an important 
part of computer science theory for decennia. We invite the reader 
interested in parallel programming to read up on the languages that 
have solid built-in support high-level parallelization abstractions, in 
particular, Scala [17], Go [21], and D [20]. 


Another recent development is the introduction of GPU comput- 
ing or “heterogeneous computing” for offloading computations. 
Most GPUs consist of an array of thousands of cores that can 
execute similar instructions at the same time. Having a few 
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thousand GPU cores can speed up processing significantly. Pro- 
gramming GPUs, however, is a speciality requiring specialized 
compilers and communication protocols, and there are many con- 
siderations, not least the I/O bottleneck between the main mem- 
ory and the GPU’s dedicated RAM [5]. Even so, it is interesting to 
explore the use of GPUs in bioinformatics since they come with 
almost every computer today and clusters of GPU can increasingly 
be found in HPC infrastructure and in the cloud, alike. With the 
advent of “deep neural networks” and the general adoption of 
machine learning techniques for Big Data, GPUs have become a 
mainstream technology in data mining. 


2 Package Software in a Container 


Container technologies, such as Docker and Singularity, have 
gained popularity because they have less overhead than full virtual 
machines (VMs) and are smaller in size [24]. Containers are fully 
supported by the major cloud computing providers and play an 
important role for portability across different platforms. 

Adoption of container solutions on HPC has been problematic, 
mostly because of security concerns. Singularity [26] offers a 
decentralized environment encapsulation that works in user space 
and that can be deployed in a simpler way since no root privileges 
are required to execute tools provided with Singularity. That is, 
Singularity containers can be created on a system with root privi- 
leges but run on a system without root privileges—though it 
requires some special kernel support. Docker containers can be 
imported directly in Singularity, so when we present how to build 
Docker container images in the following sections, the reader 
should be aware that the same images can also be used with Singu- 
larity. Singularity is slowly being introduced in HPC setups [27]. 

GNU Guix also has support for creating and running Linux 
containers. One interesting benefit is that, because the software 
packaging system is read-only and provides perfect isolation, con- 
tainers automatically can share specific software running on the 
underlying system, making running containers even lighter and 
extremely fast. 

In this section we discuss three popular software distribution 
systems for Linux: Debian GNU/Linux (Debian), GNU Guix, and 
Conda can be used together on a single system allowing access to 
most bioinformatics software packages in use today. In this section 
we bundle tools that can be deployed in a Docker image, which can 
run on a single multicore desktop computer and a compute cluster 
and in the cloud. 
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2.1 Debian Med 


2.1.1 Create a Docker 
Image with Debian 


2.2 GNU Guix 


Debian (http: //www.debian.org) is the oldest software distribution 
(started 1993) mentioned here with the largest body of software 
packages. Debian targets a wide range of architectures and includes 
a kernel plus a large body of other user software including graphical 
desktop environments, server software, and specialist software for 
scientific data processing. Overall Debian represents millions of 
users and targets most platforms in use today, even though it is 
not the only packaging system around (RPM being a notable 
alternative, for RedHat, Fedora, OpenSuSE, and CentOS). 

Debian Med is a project within Debian that packages software 
for medical practice and biomedical research. The goal of Debian 
Med is a complete open system for all tasks in medical care and 
research [28]. With Debian Med over 400 precompiled bioinfor- 
matics software programs are available for Linux, as well as some 
400 R packages. Proper free and open-source software (FOSS) can 
easily be packaged and distributed through Debian. Debian and its 
derivatives, such as Ubuntu and Mint, share the deb package format 
and have a long history of community support for bioinformatics 
packages [28, 29]. 


Using the bio packages already present in Debian, it is straightfor- 
ward to build a Docker container that includes all the necessary 
software to run the example workflows. Here is the code for creat- 
ing the Docker image (see also [30]). We created a pre-built Docker 
image which is available on Docker Hub [31]. 

Essentially, write a Docker script: 


FROM debian: buster 

RUN apt-get update && apt-get -y install perl clustalo paml 
ADD pal2nal.pl /usr/local/bin/pal2nal.pl 

RUN chmod +x /usr/local/bin/pal2nal.pl 


And build and run the container: 


docker build -t scalability_debian -f Dockerfile.debian 


GNU Guix (https://www.gnu.org/software/guix/) is a package 
manager of the GNU project that can be installed on top of other 
Linux distributions and represents a rigorous approach toward 
dependency management [32]. GNU Guix software packages are 
uniquely isolated by a hash value computed over all inputs, includ- 
ing the source package, the configuration, and all dependencies. 
This means that it is possible to have multiple versions of the same 
software and even different variants or combinations of software, 
e.g., Apache web server with SSL and without SSL compiled on a 
single system. 


2.2.1 Create a Docker 
Image with GNU Guix 


2.3 Conda 


2.3.1 Create a Docker 
Image with Bioconda 
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As of November 2017, GNU Guix provides over 6500 soft 
ware packages, including a wide range of dedicated scientific soft- 
ware for bioinformatics, statistics, and machine learning 


GNU Guix has native support for creating Docker images. Creating 
a Docker image with GNU Guix is a one liner: 


guix pack -f docker -S /bin=bin paml clustal-omega 

which creates a reproducible Docker image containing PAML and 
Clustal Omega [33], including all of their runtime dependencies. 
Guix makes it very easy to write new package definitions using the 
Guile language (a LISP). If you want to include the definition of 
your own packages (that are not in Guix main line), you can include 
them dynamically. This is how we add pal2nal [34] in below GWL 
workflow example (see Subheading 3.3 below). 


Conda (https://conda.io/docs/) is a cross-platform package man- 
ager written in Python that can be used to install software written in 
any language. Conda allows the creation of separate environments 
to deploy multiple or conflicting packages versions, offering a 
means of isolation. Note that this isolation is not as rigorous as 
that provided by GNU Guix or containers. The Bioconda [35] 
(https: //bioconda.github.io/) project provides immediate access 
to over 2900 software packages for bioinformatics, and it is main- 
tained by an active community of more than 200 contributors. 


A Docker container can be created starting from the “Miniconda” 
image template, which is based on Debian. The Docker instruc- 
tions are comparable to those of Debian above: 


FROM conda/miniconda3 
RUN conda config --add channels conda-forge 

RUN conda install -y perl=5.22.0 

RUN conda install -y -c bioconda paml=4.9 clustalo=1.2.4 
wget=1.19.1 

ADD pal2nal.pl /usr/local/bin/pal2nal.pl 

H 


UN chmod +x /usr/local/bin/pal2nal.pl 


Note that we provide the version numbering of the packages. If 
you want to build this container, you can use the Dockerfile 
provided in the GitHub repository [30] and then run: 


docker build -t scalability . 


We also added a pre-built container image on Docker 
Hub [31]. 

Conda can also be used outside any container system to install 
the software directly on a local computer or cluster. To do that first 
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2.4 A Note on 
Software Licenses 


install the Miniconda package https: //conda.io/miniconda.html, 
and then you can create a separate environment with the necessary 
software to run the workflows. Following is an example to set up a 
working environment: 


conda create -n scalability 

source activate scalability 

conda config --add channels conda-forge 

conda install -y perl=5.22.0 

conda install -y -c bioconda paml=4.9 clustalo=1.2.4 
wget=1.19.1 

wget http://www.bork.embl.de/pal2nal/distribution/pal2nal. 
vl4.tar.gz 

tar xzvf pal2nal.vl4.tar.gz 

sudo cp pal2nal.v14/pal2nal.pl /usr/local/bin 

sudo chmod +x /usr/local/bin/pal2nal.pl 


Note that we use Miniconda here to bootstrap Bioconda. 
Bioconda can be bootstrapped in other ways. One of them is 
GNU Guix which contains a Conda package. 


All above packaging systems use free and open-source software 
(FOSS) released under a permissible license, i.e., a license permit- 
ting the use, modification, and distribution of the source code for 
any purpose. This is important because it allows software distribu- 
tions to distribute all included software freely. Software that is made 
available under more restrictive licenses, such as for “academic 
nonprofit use only,” cannot be distributed in this way. An example 
is PAML that used to have such a license. Only when it was changed 
PAML got included into Debian, etc. Also, for this book chapter, 
we asked the author of pal2nal to add a proper license. After adding 
the GPLv2, it became part of the Debian distribution; see also 
https: //tracker.debian.org/pkg/pal2nal. This means that above 
Docker scripts can be updated to install the pal2nal Debian 
package. 

When you use scientific software, always check the type of 
license under which it is provided, to understand what you can or 
cannot do with it. When you publish software, add a license along 
with your code, so others can use it and distribute it. 

Typical licenses used in bioinformatics are MIT (Expat) and 
BSD, which are considered very permissive, and also GPL and the 
Apache License, which are designed to grant additional protections 
with regard to derivative works and patentability. Whenever possi- 
ble, free software licenses such as mentioned above are encouraged 
for scientific software. Check the guidelines of your employer and 
funding agencies. 
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3 Create a Scalable and Reusable Workflow 


3.1 Example 
Workflow 


We have created a number of examples to test a scalable and 
reproducible workflow, the full code, and examples that are avail- 
able on GitHub [30]. In this case putative gene families of the 
oomycete Phytophthora infestans are tested for evidence of positive 
selection. P. infestans is a single-cell pathogen, which causes late 
blight of potato and tomato. Gene families under positive selection 
pressure may be involved in protein-protein interactions and are 
potentially of interest for fighting late blight disease. 

As an example the P. infestans genome data [36] was fetched 
from http://www.broadinstitute.org/annotation/genome/phyto 
phthora_infestans/MultiDownloads.html, and predicted genes 
were grouped by \name{blastclust} using 70% identity (see also 
Chapter. 21). This resulted in 72 putative gene families listed on 
the online repository on GitHub [30]. 

The example workflow aligns amino acid sequences using Clus- 
tal Omega, creates a neighbor joining tree, and runs CodeML from 
the PAML suite. The following is one example to look for evidence 
of positive selection in a specific group of alignments: 


clustalo -i data/clusterXXxXxXX/aa.fa --guidetree-out=data/ 
Cluster daa pb > data/clusterXXXXXX/aa.aln 

pal2nal.pl -output paml data/clusterXXxXxXX/aa.aln data/clus- 
terXXxxxX/nt.fa > data/clusterXXxXXX/alignment.phy 

cd data/clusterxXxXxXxx 

Codeml ../paml0-3.ctl 


First we align amino acid with Clustal Omega, followed by 
translation to a nucleotide alignment with pal2nal. Next we test 
for evidence of positive selection using PAML’s \name{Codeml} 
with models MO—M3. Note that the tools and settings used here are 
merely chosen for educational purposes. The approach itself here 
may result in false positives, as explained by Schneider et al. 
[37]. Also, PAML is not the only software that can test for evidence 
of positive selection, for example, the HyPhy molecular evolution 
and statistical sequence analysis software package contains similar 
functionality and uses MPI to parallelize calculations [38 ]. PAML is 
used here because it is a reference implementation and is suitable as 
an example how a legacy single-threaded bioinformatics application 
can be parallelized in a workflow. 

In the next section, different workflow systems are presented 
that can be used to run the described analysis: in a scalable and 
reproducible manner, locally on a desktop, on a computer cluster, 
or in the cloud. All the code and data to run these examples is 
available on GitHub [30]. To load the code on your desktop, clone 
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3.2 Common 
Workflow Language 


the git repository locally. The examples can be executed from the 
repository tree: 


git clone https://github.com/EvolutionaryGenomics/scalabil- 


ity-reproducibility-chapter.git 


Common workflow language (CWL, http://www.commonwl. 
org/) is a standard for describing workflows that are portable across 
a variety of computing platforms [39]. CWL is a specification and 
not a software in itself though it comes with a reference implemen- 
tation which can be run with Docker containers. CWL promotes an 
ecosystem of implementations and supporting systems to execute 
the workflows across multiple platforms. The promise is that when 
you write a workflow for, e.g., Arvados, it should also run on 
another implementation, e.g., Galaxy. 

Given that CWL takes inspiration from previously developed 
tools and GNU Make in particular [40], the order of execution in a 
CWL workflow is based on dependencies between the required 
tasks. However unlike GNU Make, CWL tasks are defined to be 
isolated, and you must be explicit about inputs and outputs. The 
benefits of explicitness and isolation are flexibility, portability, and 
scalability: tools and workflows described with CWL can transpar- 
ently leverage software deployment technologies, such as Docker, 
and can be used with CWL implementations from different ven- 
dors, and the language itself can be applied to describe large-scale 
workflows that run in HPC clusters, or the cloud, where tasks are 
scheduled in parallel across many nodes. 

CWL workflows are written in JSON or YAML formats. A 
workflow consists of blocks of steps, where each step in turn is 
made up of a task description that includes the inputs and outputs 
of the task itself. The order of execution of the tasks is determined 
automatically by the implementation engine. In the GitHub repos- 
itory, we show an example of a CWL workflow to describe the 
analysis over the protein alignments. To test the workflow, you 
will need the CWL reference runner implementation: 


pip install cwlref-runner 
and then to run the example from the repository tree: 


CWL/workflow.cwl --clusters data 
To run the CWL workflow on a grid or cloud multi-node 
system, we can install another CWL implementation, this one 


built upon the toil platform [41]: 


pip install toil[cwl] 


toil-cwl-runner CWL/workflow.cwl --clusters data 


3.3 Guix Workflow 
Language 
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Workflow Outputs 


Fig. 1 Workflow automatically generated from the CWL schema displays how 
PAML’s Codeml receives inputs from two sources and outputs the Our infor- 
mation. A workflow engine figures out that it has to run clustal first, followed by 
pal2nal and Codeml as a linear sequence. For each input, the job can be 
executed in parallel 


CWL comes with extra tooling, such as visualization of CWL 
workflows (Fig. 1). See view.commonwl.org for more examples. 


The Guix Workflow Language (GWL) extends the functional pack- 
age manager GNU Guix [32] with workflow management capabil- 
ities. GNU Guix provides an embedded domain-specific language 
(EDSL) for packages and package composition. GWL extends this 
EDSL with processes and process composition. 

In GWL, a process describes the computation, for example, 
running the clustalo program. A workflow in the GWL describes 
how processes relate to each other. For example, the Codeml 
program can only run after both clustalo and pal2nal finished 
successfully. 

The tight coupling of GWL and GNU Guix ascertains that not 
only the workflow is described rigorously but also the deployment 
of the programs on which the workflow depends. 

To run the GWL example, you need to install GNU Guix 
(https: //www.gnu.org/software/guix/manual/html_node/ 
Binary-Installation.html) and the GWL installed on your com- 
puter. Once GNU Guix is available, installing GWL can be done 


using: 


guix package -i gwl 
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3.4 Snakemake 


The example can be run using: 


cd scalability-reproducibility-chapter/GWL 


guix workflow -r example-workflow 


GWL also implements execution engines to offload computa- 
tion on compute clusters, allowing it to scale. The process engines 
can use the package composition capabilities of GNU Guix to 
create the desirable form of software deployment—be it installing 
programs on the local computer or creating an application bundle, 
a Docker image, or a virtual machine image. 

Running our example on a cluster that has Grid Engine: 


guix workflow -r example-workflow -e grid-engine 


GNU Guix + GWL can ensure full reproducibility of an analy- 
sis, including all software dependencies—all the way down to glibc. 
GNU Guix computes a unique string, a hash, on the complete set 
of inputs and the build procedure of a package. It can guarantee 
that a package is built with the same source code, dependency 
graph, and the same build procedure, and produces identical out- 
put. In GWL for each process and workflow, a hash is computed of 
the packages, the procedure, and the execution engine. By compar- 
ing hashes it is not only possible to compare whether the workflow 
is running using the exact same underlying software packages, and 
using the same procedures, but also the full graph of dependencies 
can be visualized. To obtain such an execution plot: 


guix package -i graphviz 
guix workflow -g example-workflow | dot -Tpdf > example- 
workflow.pdf 


Note that, unlike the other workflow solutions discussed here, 
GWL does not use the time stamps of output files. The full depen- 
dency graph is set before running the tools, and it only needs to 
check whether a process returns an error state. This means that 
there are no issues around time stamps and output files do not have 
to be visible to the GWL engine. 


Snakemake [42] is a workflow management system that takes inspi- 
ration from GNU Make [40], a tool to coordinate the compilation 
of large programs consisting of interdependent source files 
(https: //snakemake.readthedocs.io/en/stable/). 

Snakemake provides a DSL that allows the user to specify 
generator rules. A rule describes the steps that need to be per- 
formed to produce one or more output files, such as running a 
shell script. These output files may be used as inputs to other rules. 
The workflow is described as a graph in which the nodes are files 


3.5 Nextflow 
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(provided input files, generated intermediate files, or the desired 
output files) and the edges are inferred from the input/output 
interdependencies of connected rules. 

When a user requests a certain file to be generated, Snakemake 
matches the file name against concrete or wildcard rules, traverses 
the graph from the target file upward, and begins processing the 
steps for every rule for which no new output file is available. 
Whether or not an output file is considered new depends on its 
time stamp relative to the time stamp of prerequisite input files. In 
doing so, Snakemake only performs work that has not yet been 
done or for which the results are out of date, just like GNU Make. 
Snakemake can be configured to distribute jobs to batch systems or 
to run jobs on the local system in parallel. The degree of paralleliza- 
tion depends on the dependencies between rules. 

Snakemake is written in Python and allows users to import 
Python modules and use them in the definition of rules, for exam- 
ple. It also has special support for executing R scripts in rules, by 
exposing rule parameters (such as inputs, outputs, concrete values 
for wildcards, etc.) as an S4 object that can be referenced in the R 
script. 

Snakemake provides native support for the Conda package 
manager. A rule may specify a Conda [35] environment file describ- 
ing a software environment that should be active when the rule is 
executed. Snakemake will then invoke Conda to download the 
required packages as specified in the environment file. Alternatively, 
Snakemake can interface with an installation of the Singularity 
container system [26] and execute a rule within the context of a 
named application bundle, such as a Docker image. 

To run the Snakemake workflow, you need to install Snakemake 
(example showed with Conda): 


conda install -y -c bioconda snakemake=4.2.0 
And then to run the example from the repository tree: 


cd Snakemake 


snakemake 


Nextflow [43 ] is a framework and an orchestration tool that enables 
scalable and reproducible scientific workflows using software con- 
tainers (https://www.nextflow.io/). It is written in the Groovy 
JVM programming language [44] and provides a domain-specific 
language (DSL) that simplifies writing and deploying complex 
workflows across different execution platforms in a portable 
manner. 

A Nextflow pipeline is described as a series of processes, where 
each process can be written in any language that can be executed or 
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interpreted on Unix-like operating systems (e.g., Bash, Perl, Ruby, 
Python, etc.). A key component of Nextflow is the dataflow pro- 
gramming model, which is a message-based abstraction for parallel 
programming similar to the CSP paradigm (see [23]). The main 
difference between CSP and dataflow is that in the former, pro- 
cesses communicate via synchronous messages, while in the latter, 
the messages are sent in an asynchronous manner. This approach is 
useful when deploying large distributed workloads because it has 
latency tolerance and error resilience. In practical term the dataflow 
paradigm uses a push model in which a process in the workflow 
sends its outputs over to the downstream processes that waits for 
the data to arrive before starting their computation. The commu- 
nication between processes is performed through channels, which 
define inputs and outputs for each process. Branches in the work- 
flow are also entirely possible and can be defined using conditions 
that specify if a certain process must be executed or not depending 
on the input data or on user defined parameters. 

The dataflow paradigm is the closest representation of a pipe- 
line idea where, after having opened the valve at the beginning, the 
flow progresses through the pipes. But Nextflow can handle this 
data flow in a parallel and asynchronous manner, so a process can 
operate on multiple inputs and emit multiple outputs at the same 
time. In a simple workflow where, for instance, there are 100 nucle- 
otide sequences to be aligned with the NCBI NT database using 
BLAST, a first process can compute the alignment of the 
100 sequences independently and in parallel, while a second process 
will wait to receive and collect each of the outputs from the 
100 alignments to create a final results file. To allow workflow 
portability, Nextflow supports multiple container technologies 
such as Docker and Singularity and integrates natively with Git 
and popular code sharing platforms, such as GitHub. This makes 
it possible to precisely prototype self-contained computational 
workflows, tracking also all the modifications over time and ensur- 
ing the reproducibility of any former configuration. Nextflow 
allows executing workflows across different computing platforms 
by supporting several cluster schedulers (e.g., SLURM, PBS, LSF 
and SGE) and allowing direct execution on the Amazon cloud 
(AWS), using services, such as AWS Batch or automating the crea- 
tion of a compute cluster in the cloud for the user. 

To run the Nextflow example, you need to have Java 8 and a 
Docker engine (1.10 or higher) installed. Next install Nextflow 
with: 


curl -s https://get.nextflow.io | bash 


Run the example from the repository tree: 
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./nextflow run Nextflow/workflow.nf -with-docker evolutionar- 


ygenomics/scalability 


To save the graph of the executed workflow, it is sufficient to 
add the option “-with-dag workflow.pdf.” The same example can 
also be run without Docker if the required packages have been 
installed locally following the Bioconda or Guix examples. In this 
case you can omit the “-with-docker” instruction. To run the 
example on a compute cluster or in the cloud, it is sufficient to 
specify a different executor (e.g., sge or awsbatch) in the Nextflow 
configuration file and ensure that those environments are config- 
ured to properly work with the Docker container. 


In this chapter we show how to describe and execute the same 
analysis using a number of workflow systems and how these follow 
different approaches to tackle execution and reproducibility issues. 
It is important to assess underlying design choices of these solu- 
tions and also to look at the examples we provide online. Even 
though it may look attractive to opt for the simplest choices, it may 
be that the associated maintenance burden may be cause for regret 
later. 

The workflow tools introduced in this chapter offer direct 
integration of software packages. The overall advantage of the 
bundling software approach is that when software deployment 
and execution environment are controlled, the logic of the analysis 
pipeline can be developed separately using descriptive workflows. 
This separation allows communities to build best practice shareable 
pipelines without worrying too much about individual system 
architectures and the underlying environments. An example is the 
effort by the Global Alliance for Genomics and Health (GA4GH, 
https: //www.ga4gh.org) to develop and share best practice analysis 
workflows with accompanying container images [45 ]. 

In this chapter we also discussed the scaling up of computations 
through parallelization. In bioinformatics, the common paralleliza- 
tion strategy is to take an existing nonparallel application and divide 
data into discrete units of work, or jobs, across multiple CPUs and 
clustered computers. Ideally, running jobs in parallel on a single 
multicore machine shows linear performance increase for every 
CPU added, but in reality it is less than linear [46]. Resource 
contention on the machine, e.g., disk or network I/O, may have 
processes wait for each other. Also, the last, and perhaps longest, 
running job causes total timing to show less than linear perfor- 
mance, as the already finished CPUs are idle. In addition to the 
resource contention on a single machine, the network introduces 
latencies when data is moved around. 
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Running the example workflow in the cloud has similar perfor- 
mance and scalability compared to running it on a local infrastruc- 
ture, after adjusting for differences in hardware and network 
speeds. Cloud computing is an attractive proposition for scaling 
up calculation jobs and storing data. Cloud prices for virtual servers 
and data storage have decreased dramatically, and the possibility of 
using spot or preemptible instances (i.e., virtual servers that can be 
priced down to 70% or 80% the normal price but that can be shut 
down in any moment by the cloud provider) is making cloud 
computing solutions competitive for high-performance and scien- 
tific computing. Cloud essentially outsources hardware and related 
plumbing and maintenance. Sophisticated tooling allows any 
researcher to run software in the cloud. We predict an increasing 
number of groups and institutes will move from large-scale HPC 
clusters toward tight HPC cluster solutions that can handle contin- 
uous throughput with burst compute in the cloud. 

Reproducibility is a prime concern in science. Today several 
solutions are available to address reproducibility concerns. Systems 
such as Docker and Singularity are built around bundling binary 
applications and executing them in a container context. Advanced 
package managers such as Conda or Guix allow the user to create 
separate software environments where different application versions 
can be deployed without collisions while ensuring control and 
traceability over changes and dependencies. All these solutions 
represent a different approach to address the reproducibility chal- 
lenge while also offering a different user experience and requiring 
different setups to work properly. For instance, container-based 
systems such as Docker and Singularity are not always a viable 
option in HPC environments since they may require updates to 
the existing computing infrastructure. Also, HPC operating system 
installations may include kernel versions that do not allow for the 
so-called user namespaces, a fundamental component among the 
many kernel features that together allow an application to run in an 
isolated container environment. Another downside of containers is 
that it is hard to assess what is in them—they act like a black box. 
When creating containers with above Docker scripts, it depends on 
the time they are assembled what goes in. A Debian or Conda 
update between creating containers, for example, may include a 
different software version therefore a different dependency graph. 
Only GNU Guix containers provide a clear view on what is 
contained. 

Containers provide isolation from the underlying operating 
system. On HPC environments it may be required to run software 
outside a container. While applications built with Guix or Conda 
can be run in isolation when container support is available, they do 
not require these features at runtime. As a package manager Conda, 
neither depends on container features nor on root privileges, but it 
pays for this convenience with a lack of both process isolation and 
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bit-reproducibility [47]. GNU Guix, meanwhile, provides the most 
rigorous path to reproducible software deployment. In order to 
guarantee that packages are built in a bit-reproducible fashion and 
share binary packages, Guix requires to store packages in the direc- 
tory /gnu/store. There are several work-arounds for this; one of 
them is by using containers, and another is by mounting /gnu/ 
store from a host that has built privileges for that directory. A third 
option is to build packages targeted at a different directory, but this 
loses the bit-reproducibility and the convenience of binary installs. 
A fourth option is to provide relocatable binary installation 
packages that can be installed in a user available directory, similar 
to what Bioconda does. Such packages exist for sambamba, gemma, 
and the D-compiler. 

Finally, each combination of these packaging and workflow 
solutions occupies a slightly different region in the solution space 
for the scalability and reproducibility challenge. Fortunately, the 
packaging tools can be used next to each other without interfer- 
ence, thereby providing a wealth of software packages for bioinfor- 
matics. Today, there is hardly ever a good reason to build software 
from source. 


1. Using one of the packaging or container systems described 
(e.g., Conda, Guix, or Docker), prepare a working environ- 
ment to run the examples. Now try to run the workflows using 
the tools presented and appreciate the different approaches to 
execute the same example. 


2. Compare the different syntaxes used by the tools to define a 
workflow and explore how each tool describes the processes 
and the dependencies in a different way. 


3. Use the Amazon EC2 calculation sheet, and calculate how 
much it would cost to store 100 GB in S3, and execute a 
calculation on 100 “large” nodes, each reading 20 GB of 
data. Do the same for another cloud provider. 
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Abstract 


Open-source software encourages computer programmers to reuse software components written by others. 
In evolutionary bioinformatics, open-source software comes in a broad range of programming languages, 
including C/C++, Perl, Python, Ruby, Java, and R. To avoid writing the same functionality multiple times 
for different languages, it is possible to share components by bridging computer languages and Bio* 
projects, such as BioPerl, Biopython, BioRuby, BioJava, and R/Bioconductor. 

In this chapter, we compare the three principal approaches for sharing software between different 
programming languages: by remote procedure call (RPC), by sharing a local “call stack,” and by calling 
program to programs. RPC provides a language-independent protocol over a network interface; examples 
are SOAP and Rserve. The local call stack provides a between-language mapping, not over the network 
interface but directly in computer memory; examples are R bindings, RPy, and languages sharing the Java 
virtual machine stack. This functionality provides strategies for sharing of software between Bio* projects, 
which can be exploited more often. 

Here, we present cross-language examples for sequence translation and measure throughput of the 
different options. We compare calling into R through native R, RSOAP, Rserve, and RPy interfaces, with 
the performance of native BioPerl, Biopython, BioJava, and BioRuby implementations and with call stack 
bindings to BioJava and the European Molecular Biology Open Software Suite (EMBOSS). 

In general, call stack approaches outperform native Bio* implementations, and these, in turn, outper- 
form “RPC”-based approaches. To test and compare strategies, we provide a downloadable Docker 
container with all examples, tools, and libraries included. 


Key words Bioinformatics, R, Python, Ruby, Perl, Java, Web services, RPC, EMBOSS, PAML 


1 Introduction 


Bioinformatics has created its tower of Babel. The full set of func- 
tionality for bioinformatics, including statistical and computational 
methods for evolutionary biology, is implemented in a wide range 
of computer languages, e.g., Java, C/C++, Perl, Python, Ruby, and 
R. This comes as no surprise, as computer language design is the 
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result of multiple trade-offs, for example, in strictness, convenience, 
and performance. In this chapter we discuss strategies for combin- 
ing solutions from different languages and look at performance 
implications of combining cross-language functionality. In the pro- 
cess we also highlight implications of such strategic choices. 

Computer languages used in bioinformatics today typically fall 
into two groups: those compiled and those interpreted. Java, C++, 
and D, for example, are statically typed compiled languages, 
while R, Perl, Ruby, and Python are dynamically typed interpreted 
languages. In principle, a compiled language is converted into 
machine code once by a language compiler, and an interpreted 
language is compiled every time at runtime, the moment it is run 
by an interpreter. Static typing allows a compiler to optimize 
machine code for speed. Dynamic typing requires an interpreter 
and resolves variable and function types at runtime. Such design 
decisions cause Java, C++, and D to have stronger compile-time 
type checking and faster execution speed than R, Perl, Ruby, and 
Python. When comparing runtime performance of these languages, 
compiled statically typed languages, such as C++, D, and Java, 
generally outperform interpreted dynamically typed languages, 
such as Python, Perl, and R. For speed comparison between lan- 
guages, see, for example, the benchmarks game. 


Statically typed compiled languages tend to produce faster code at 
runtime 


Runtime performance, however, is not the only criterion for 
selecting a computer language. R, Perl, Ruby, and Python offer 
sophisticated interactive analysis of data in an interpreted shell 
which is not directly possible with C++, D, or Java. Another impor- 
tant criterium may be conciseness. Interpreted languages generally 
allow functionality to be written in less lines of code. The number 
of lines matter, as it is often easier to grasp something expressed in a 
short and concise fashion, if done competently, leading to easier 
coding and maintenance of software and resulting in increased 
programmer productivity. In general, with R, Perl, Ruby, and 
Python, it takes less lines of code to write software than with C+ 
+, D, or Java; this is also visible from the examples in the bench- 
marks game. 


Interpreted languages allow for concise code that is easier to read 
and results in increased programmer productivity 


Based on the conciseness criterium, computer languages fall 
into these two groups. This suggests a trade-off between execution 
speed and conciseness/programmer productivity. Even so, strong 
typing may help later when refactoring code, perhaps regaining 
some of that lost productivity. The authors also note that in their 
experience, the more programming languages one masters, the 
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easier it becomes mastering new languages (with the exception, 
perhaps, of Haskell). Learning new programming languages is 
important when writing software. 

Logically, to fully utilize the potential of existing and future 
bioinformatics functionality, it is necessary to bridge between com- 
puter languages. Bioinformaticians cannot be expected to master 
every language, and it is inefficient to write the same functionality 
for every language. For example, R/Bioconductor contains unique 
and exhaustive functionalities for statistical methods, such as for 
gene expression analysis [1]. The singular implementation of this 
functionality in R has caused researchers to invest in learning the R 
language. Others, meanwhile, have worked on building bridges 
between languages. For example, RPy and Rserve allow accessing 
R functionality from Python [2], and JRI and Rserve allow acces- 
sing R functionality from Java [3, 4]. Other languages have similar 
bindings, such as RSRuby that allows accessing R from Ruby. 

Discussing other important criteria for selecting a program- 
ming language, such as ease of understanding, productivity, porta- 
bility, and the size and dynamics of the supporting Bio* project 
developer communities, is beyond the scope of this chapter. The 
authors, who have different individual preferences, wish to empha- 
size that every language has characteristics driven by language 
design and there is no single perfect all-purpose computer lan- 
guage. In practice, the choice of a computer language depends 
mainly on the individuals involved in a project, partly due to the 
investment it takes to master a language. Researchers and program- 
mers have prior investments and personal preferences, which have 
resulted in a wide range of computer languages used in the bioin- 
formatics community. 

Contrasting with singular implementations, every mainstream 
Bio* project, such as BioPerl [5], Biopython [6], BioRuby [7], 
R/Bioconductor [1], BioJava [8], the European Molecular Biology 
Open Software Suite (EMBOSS) [9], and Bio++ [10], contains 
duplication of functionality. Every Bio* project consists of a 
group of volunteers collaborating at providing functionality for 
bioinformatics, genomics, and life science research under an 
open-source software (OSS) license. The BioPerl project does 
that for Perl, BioJava for Java, etc. Next to the language used, the 
total coverage of functionality, and perhaps quality of implementa- 
tion, differs between projects. Not only is there duplication of 
effort, both in writing and testing code, but also there are differ- 
ences in implementation, completeness, correctness, and perfor- 
mance. For example, implementations between projects differ 
even for something as straightforward as codon translation, e.g., 
in number of types of encoding and support for the translating of 
ambiguous nucleotides. EMBOSS, uniquely, attempts to predict 
the final amino acid in a sequence, even when there are only two 
nucleotides available for the last codon. 
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1.1 Bridging 
Functional Resources 
Calling from Program 
to Program 


1.2 Remote 
Procedure Call 


Whereas Chapter 25 discusses Internet data resources and how 
to share them, in this chapter, we discuss how to share functional 
resources by interfacing and bridging functionality between differ- 
ent computer languages. This is highly relevant to evolutionary 
biology as most classic phylogenetic resources were written in C, 
while nowadays phylogenetic routines are written in Java, Perl, 
Python, Ruby, and R. Especially for communities with relatively 
few software developers, we argue here that it is important to 
bridge these functional resources from multiple languages. For 
bridging, strategies are here discussed to invoke one program 
from another, use some form of remote procedure calls (RPC), or 
use a local call stack. 


The most simple way of interfacing software is by invoking one 
program from another. This strategy is often used in Bio* projects, 
for example, for invoking external programs. A regular subset 
would be PAML [11], HMMER [12], ClustalW [13], MAFFT 
[14], Muscle [15], BLAST [16], and MrBayes [17]. The Bio* 
projects typically contain modules which invoke the external pro- 
gram and parse the results. The advantage of this approach is that it 
mimics running a program on the command line, so invocation is 
straightforward. Another advantage, in a web service context, is 
that if the called program crashes, it does not have to take the whole 
service down. There are also some downsides, however. Loading a 
new instance of a program every time incurs extra overhead. More 
importantly, nonstandard input and output makes the interface 
fragile, i.e., what happens when input or output differs between 
two versions of a program? A further downside is that external 
programs do not have fine-grained function access and have no 
support for advanced error handling and exceptions. What hap- 
pens, for example, when the invoked program runs out of process 
memory? How to handle that gracefully? A final complication is 
that such a program is an external software deployment depen- 
dency, which may be hard to resolve for an end user. 


In contrast to calling one program from another, true cross- 
language interfacing allows one language to access functions 
and/or objects in another language, as if they are native function 
calls. To achieve transparent function calls between different com- 
puter languages, there are two principal approaches. The first 
approach is for one language to call directly into another language’s 
function or method over a network interface, the so-called remote 
procedure call (RPC). The second approach is to call into another 
language over a local “call stack.” 

In bioinformatics, cross-language RPC comes in the form of 
web services and binary network protocols. A web service applica- 
tion programming interface (API) is exposed, and a function call 
gets translated with its parameters into a language-independent 


1.3 Local Call Stack 
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format, a procedure called “marshalling.” After calling the function 
on a server, the result is returned in, for example, XML and trans- 
lated back through “unmarshalling.” Examples of cross-language 
XML protocols are SOAP [18] and XML/RPC [19]. 

More techniques exist for web service-type cross-language 
RPC. For example, representational state transfer (REST), or 
ReSTful [20], is a straightforward HTTP protocol, often preferred 
over SOAP because of its simplicity. Another XML-based protocol 
is Resource Description Framework (RDF), as part of the semantic 
web specification. Both REST and RDF can be used for RPC 
solutions. 

In addition, binary alternatives exist because XML-based pro- 
tocols are not very efficient. XML is verbose, increasing the data 
load, and requires parsing at both marshalling and unmarshalling 
steps. In contrast, binary protocols are designed to reduce the data 
transfer load and increase speed. Examples of binary protocols are 
Rserve [3], which is specifically designed for R, and Google proto- 
col buffers [21]. Another software framework based on a binary 
protocol is Thrift, by the Apache software foundation, designed for 
scalable cross-language service development [22]. Finally, also 
worth considering are very fast interoperable messaging-based 
paradigms, such as ZeroMQ [23], and high-level message-level 
optimizers, such as GraphQL. 


The alternative to RPC is to create native local bindings from one 
language to another using a shared native call stack, essentially 
linking into code of a different computer language. With the call 
stack, function calls do not run over the network but over a stack 
implementation in shared computer memory. In a single virtual 
machine, such as the JVM and Erlang Beam, compiled code can 
share the same call stack, which can make cross-language calling 
efficient. For example, the languages Java, Jython, JRuby, Clojure, 
Groovy, and Scala can transparently call into each other when 
running on the same virtual machine using native speeds. 

Native call stack sharing is also supported at the lowest level by 
the computer operating system through compiled shared libraries. 
These shared libraries have an extension .so on Linux, .dylib on 
OSX, and dl on Windows. The shared libraries are designed so 
that they contain code and data that provide services to indepen- 
dent programs, which allows the sharing and changing of code and 
data in a modular fashion. Shared library interfaces are well defined 
at the operating system level, and languages have a way of binding 
them. Specialized interface bindings to shared libraries exist for 
every language, for example, R’s C modules, the Java Native Inter- 
face (JNI) for the JVM, Foreign Function Interfaces (FFI) for 
Python and Ruby, the Parrot native compiler interface PerlXS 
for Perl. 
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1.4 Comparing 
Approaches 


With (dynamic) shared libraries, certain algorithms can be 
written in a low-level, high-performance compiled computer lan- 
guage, such as C/C++, D, or FORTRAN. And high-level lan- 
guages, such as Perl, Python, Ruby, R, and even Java, can access 
these algorithms. This way, languages can be mixed to optimize 
solutions. Creating these shared library interfaces, however, can be 
a tedious exercise, which often calls for code generators. One such 
generator is the Simplified Wrapper and Interface Generator 
(SWIG) [24], which consists of a macro-language, a C header file 
parser, and the tools to bind low-level shared libraries to a wide 
range of languages. For C/C++, SWIG can parse the header files 
and generate the bindings for other languages, which, in turn, call 
into these shared libraries. The Boost project has similar facilities 
for mapping calls to SWIG. C FFI’s that come with programming 
languages, such as Python’s CFFI and Ruby’s FFI, tend to be the 
easiest to work with. 

Even though this extensive functionality for interfacing is avail- 
able, the full potential of creating cross-language adapters is not fully 
exploited in bioinformatics. Rather than bridge two languages, 
researchers often opt to duplicate functionality. This is possibly 
due to a lack of information on the effort involved and the added 
complexity of creating a language bridge. Also, the impact on 
performance may be an unknown quantity. A further complication 
is the need to understand, to some degree, both sides of the equa- 
tion, i.e., to provide an R function to Python requires some under- 
standing of both R and Python, at least to the level of reading the 
documentation of the shared module and creating a working bind- 
ing. Likewise, binding Python to C using a call stack approach 
requires some understanding of both Python and C. Sometimes, 
binding of complex functions can be daunting, and deployment may 
be a concern, e.g., when creating shared library bindings on Linux, 
they may not easily work on Windows or macOS. 


Here, we compare bridging code from one language to another 
using the RPC approach and the call stack approach. As a compari- 
son we also provide a program-to-program approach and show how 
dependencies can be fixated. The comparison is done in the form of 
short experiments (scripts) which can be executed by the reader. To 
measure performance between different approaches, we use codon 
translation as an example of shared functionality between Bio* 
projects. Codon translation is a straightforward algorithm with 
table lookups. Such sequence translation is representative of many 
bioinformatics tasks that deal with genome-sized data and require 
many function calls with small-sized parameters. 

In this chapter we first focus on comparing R and Python 
bindings. We include native Bio* implementations, i.e., Biopython, 
BioRuby, BioPerl, BioJava, and EMBOSS (C) for an absolute speed 
comparison. Next we try bindings on the JVM. 


2 Results 


2.1 Calling into R 


Sharing Programming Resources Between Bio* Projects 753 


Examples and tests can in principle be experimented with a 
computer running Linux, macOS, or Windows. To ease trials, we 
have defined GNU Guix packages that contain the tools and their 
dependencies. From this we have created a downloadable Docker 
image that supports all interfaces and performance examples (GNU 
Guix and Docker are discussed in Chapter 25). 


R is a free and open-source environment for statistical computing 
and graphics [25]. R comes with a wide range of functionality, 
including modules for bioinformatics, such as bundled in R/Bio- 
conductor [1]. R is treated as a special citizen in this chapter 
because the language is widely used and comes with statistical 
algorithms for evolutionary biology, such as Ape [26] and SeqinR 
[27], both available through the comprehensive R archive network 
(CRAN). 

R defines a clear interface between the high-level language R 
and low-level highly optimized C and FORTRAN libraries, some of 
which have been around for a long time, such as the libraries for 
linear regression and linear algebra. In addition, the R environment 
successfully handles cross-platform packaging of C, C++, FOR- 
TRAN, and R code. The combination of features has resulted in 
R becoming the open-source language of choice in a number of 
communities, including statistics and some disciplines in biology. 
R/Bioconductor has gene expression analysis [1] and R/qtl [28] 
and R/qtlbim [29] for QTL mapping (see also QTL mapping in 
Chapter 21). Not all is lost, however, for those not comfortable 
with the R language itself. R can act as an intermediate between 
functionality and high-level languages. A number of libraries have 
been created that interface to R from other languages, either 
providing a form of RPC, through RSOAP or Rserve, or a call 
stack interface calling into the R-shared library and executing R 
commands, for example, RPy for Python, RSPerl for Perl, RSRuby 
for Ruby, and JRI for Java. Of the last call stack approaches, RPy 
currently has the most complete implementation; see also [2 ]. 

In this chapter, we compare different approaches for invoking 
full R functionality from another language. To test cross-language 
calling, we elected to demonstrate codon translation. Codon-to- 
protein amino acid translation is representative for a relatively sim- 
ple computation that potentially happens thousands of times with 
genome-sized data. Every Bio* project includes such a translation 
function, so it is a fair way to test for language interoperability and 
performance. For data, we use a WormBase [30] C. elegans cDNA 
FASTA file (33 Mb), containing 24,652 nucleotide sequences, 
predicted to translate to protein (Fig. 1). 
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Translate DNA sequences to protein 
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Fig. 1 Throughput of mRNA to protein translation using combinations of cross-language calling with a range of 
programming resources. WormBase C. elegans predicted protein coding DNA that was parsed in FASTA format 
and translated into amino acids. Tests were executed inside a container. Different file sizes were used 
containing 500, 1000, 5000, 15,000, and 25,000 sequences (X-axis) and the number of sequences processed 
per seconds (Y-axis og, scale). Measurements were taken on an AMD Opteron(TM) 6128 8 cores at 2.0 GHz, 
4 sockets x 8 cores, with 512 GB RAM DDR3 ECC, and an HDD SATA of 2 TB. Broadly the figure shows that 
sustained throughput is reached quickly and flattens out. R-Biostrings performs poorly at 285 Seq/s, while 
R-GeneR and Rserve (Python+Rserve+GeneR) perform at the level of native Do" libraries, respectively, 
658 Seq/s and 660 Seq/s. The cross-language Ruby-FFI at 6256 Seq/s calls EMBOSS C translation and 
outperforms all others 


2.1.1 Using GeneR with “The R/Bioconductor GeneR package [31] supports fast codon 

Plain R translation with the strTranslate function implemented in C.” 
GeneR supports the eukaryotic code and other major encoding 
standards. R usage is: 


library (GeneR) 
strTranslate("atgtcaatggtaagaaatgtatcaaatcagagcgaaaaattg- 
gaaattttgt") 

[1] "MSMVRNVSNQSEKLEIL" 


The \name{R+GeneR} script (also available here) reads: 
fasta = ‘dna.fa’ 


library (GeneR) 


idx = indexFasta (fasta) 


2.1.2 Calling into R from 
Other Languages with RPC 
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lines <-readLines( paste(fasta,’.ix’,sep='’) ) 
index <-read.table(paste(fasta,’.ix’,sep='’’) )[,1] 
n= 0 


for (i in 1:times) { 
for (name in index) { 
readFasta (file=fasta, name = name) 
ntseq = getSeq(0) 
aaseq = strTranslate(ntseq) 
cat(">",name," (",n,")\n",aaseqg,"\n",sep="") 
n = n+l 
} 
} 
and parses the nucleotide FASTA input and outputs amino acid 
FASTA. Run the script: 


docker run --rm -v ‘pwd‘/tmp:/tmp -v ‘pwd‘/scripts:/scripts -e \ 
BATCH_VARS=/tmp/test-dna-${i}.fa -t bionode bash -c "source 

/etc/profile 

cd /book-evolutionary-genomics 

./scripts/create_test_files.rb 

R -q --no-save --no-restore --no-readline --slave < src/R/ 


DNAtranslate_GeneR.R" > /dev/null 


Used directly from R, the throughput of the GeneR module is 
about 658 sequences per second (Seq/s) on the test system, an 
AMD Opteron(TM) 6128 CPU at 2.00 GHz (see also Fig. 1). 
When checking the implementation by reading the source code, 
in the first edition, we found that the GeneR FASTA parser was a 
huge bottleneck. The FASTA parser implementation created an 
index on disk and reloaded the full index file from disk for each 
individual sequence, thereby incurring a large overhead for every 
single sequence. 

To see if we could improve throughput, we replaced the slow 
FASTA parser with \name{R+Biostrings} which reads FASTA once 
into RAM using the R/Bioconductor BioStrings module and still 
uses GeneR to translate. At the time, this implementation was 1.6 
times faster than GeneR. At this time GeneR is 3.2 on average faster 
than reading with Biostrings which had a throughput of 
284.83 Seqs/s proving some work was done by the authors to 
improve GeneR. The second script can be found here. 


One strategy for bridging between languages is to use R as a 
network server and invoke remote procedure calls (RPC) over the 
network. 


1. SOAP 
SOAP allows processes to communicate using XML over 
HTTP in a client/server setup. SOAP is an operating system 
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2.1.3 Calling into R from 
Other Languages with the 
Call Stack Approach 


and computer language “agnostic,” so it can be used to bridge 
between languages. In the previous edition of this chapter {Ref 
to Previous Edition, same chapter}, we wrote a R/SOAP [32] 
adapter for codon translation and invoked it from Python 
(a Python to R bridge). That client script can be found here. 
The SOAP bridge was dropped from this chapter because the 
SOAP packages are not maintained and it was by far the slowest 
method of cross-language interfacing we tried! The marshal- 
ling and unmarshalling of simple string objects using XML over 
a local network interface takes a lot of computational resources. 
We do not recommend using SOAP. 


. Rserve 


Rserve [3] is a custom binary network protocol, more efficient 
than XML-based protocols [3]. R data types are converted into 
Rserve binary data types. Rserve was originally written for Java, 
but nowadays connectors exist for other languages. With 
Rserve, Python and R do not have to run on the same server. 
Furthermore, all data structures will automatically be con- 
verted from native R to native Python and numpy types 
and back. 


With RServe fired up a Python example is: 


import pyRserve 


conn = pyRserve.connect () 


conn.eval(‘library(GeneR) ’ ) 


conn.eval(‘strTranslate("atgtcaatggtaagaaatgtatcaaatcagagc- 


gaaaaattggaaatt 
‘MSMVRNVSNOQSEKLI 


CEGE)” 
EIL’ 


where Rserve+GeneR uses the GeneR translate function. In 
our test Biopython [6] is used for parsing FASTA, and at 
797 Seq/s, even with this network bridge, Python+Rserve’s 


speed is on par with that of R. The script can be found here. 


Another strategy for bridging language is to use a native call stack, 
i.e., data does not get transferred over the network. RPy2 executes 
R code from within Python over a local call stack [2]. Invoking the 
same GeneR functions from Python: 


import rpy2.robjects as robjects 


from rpy2.robjects.packages import importr 


importr ('GeneR') 


strTranslate=robjects.r['strTranslate'] 


strTranslate("atgtcaatggtaagaaatgtatcaaatcagagcgaaaaattg- 


gaaattttgt") 
‘MSMVRNVSNQS! 


[0] 
EKL] 


EIL’ 


2.2 Native Bio* 
Implementations 


2.3 Using the JVM 
for Cross-Language 
Support 
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This example uses Biopython for parsing FASTA and invokes 
GeneR translation over a call stack handled by RPy2. At 2049 Seq/ 
s, throughput is the highest of our calling into R examples. The 
Python implementation outperforms the other FASTA parsers, and 
GeneR is fast too when only the translation function is called 
(GeneR’s strTranslate is actually written in C, not in R). Still, 
there are some overheads for bridging and transforming string 
objects from Python into Rand back. The RPy2 call stack approach 
is efficient for passing data back and forth. The script can be found 
here. 


When dealing with cross-language transport comparisons, it is 
interesting to compare results with native language implementa- 
tions. For example, Biopython [6] would be: 


from Bio.Seq import Seq 

from Bio.Alphabet import generic_dna 

coding_dna = Seq("atgtcaatggtaagaaatgtatcaaatcagagcgaaaaattg- 
gaaattttgt", generic_dna) 

coding_dna.translate() 


Seq(’MSMVRNVSNQSEKLEIL’, ExtendedIUPACProtein() ) 


which runs at 797 Seq/s which is slower than the Python3+RPy2 
+GeneR version. This is because the translate function is written in 
Python and not in C. It is, however, still faster than R+GeneR. 
Ruby+BioRuby runs faster at 1481 Seq/s. Perl+BioPerl is in the 
middle with 1165 Seq/s. We can assume the Biopython, BioPerl, 
and BioRuby implementations are reasonably optimized for perfor- 
mance. Therefore, throughput reflects the performance of these 
interpreted languages (see Fig. 1). 

Java is a statically typed compiled language. Java+BioJava [8] 
outperforms the interpreters and runs at 2266 Seq/s. 

The source code for all examples can be found here in the 
{Biopython}, {BioRuby}, {BioPerl}, and {BioJava} subdirectories. 


The Java virtual machine (JVM) is a “bytecode” standard that 
represents a form of computer intermediate language. This lan- 
guage conceptually represents the instruction set of a stack- 
oriented capability architecture. This intermediate language, or 
“bytecode,” is not tied to Java specifically, and in the last 10 years, 
a number of languages have appeared which target the JVM, 
including JRuby (Ruby on the JVM), Jython (Python on the 
JVM), Groovy [33], Clojure [34], and Scala [35]. These languages 
also compile into bytecode and share the same JVM stack. The 
shared JVM stack allows transparent function calling between dif- 
ferent languages. 
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2.4 Shared C Library 
Cross-Calling Using 
EMBOSS Codon 
Translation 


2.4.1 FFI 


An example of calling BioJava translation from a Scala program: 


import org.biojava.nbio.core.sequence.transcription.Tran- 
scriptionEngine 


import org.biojava.nbio.core.sequence._ 


val transcriber = TranscriptionEngine.getDefault () 

val dna = new DNASequence("atgtcaatggtaagaaatgtatcaaatcagagc- 
gaaaaattggaaattttgt") 

val rna = dna.getRNASequence (transcriber) 
rna.getProteinSequence (transcriber) 


‘MSMVRNVSNOSEKLEIL’ 


which uses the BioJava libraries. 

A native Java function, such as getProteinSequence, is directly 
invoked from the other language without overheads (the passed-in 
transcriber object is passed by reference, just like in Java). In fact, 
Scala compiles to bytecode, which maps one to one to Java, includ- 
ing the class definitions. The produced bytecode is a native Java 
bytecode; therefore, the performance of calling BioJava from Scala 
or Java is exactly the same. This also holds for other languages on 
the JVM, such as Clojure and Groovy. 

We have also included a JRuby example that calls into BioJava4 
on the JVM and runs at 1413 Seq/s. JRuby is an interpreter on the 
JVM that still needs some translation calling into JVM functions. It 
is therefore slower than native calls. 


EMBOSS is a free and OSS analysis package specially developed for 
the needs of the molecular biology user community, mostly written 


in C [9]. 


Using Foreign Function Interface (FFI), it is possible to load 
dynamic libraries at runtime, define classes to map composite data 
types, and bind functions for a later use inside your host program- 
ming language. We used FFI to bind the EMBOSS translation 
function to Python and Ruby. The Python example: 


from ctypes import * 

import os 

emboss = cdll.LoadLibrary(os.path.join(os.path.dirname(os. 
path.abspath(__file__)),"emboss.so") ) 

trnTable = emboss.ajTrnNewI (1) 

ajpseq = emboss.ajSeqNewNameC (b"atgtcaatggtaagaaatgtatcaaat- 
cagagcgaaaaattggaaattttgt", b"Test sequence") 

ajpseqt = emboss.ajTrnSeqOrig(trnTable,ajpseq,1) 

seq = emboss.ajSeqGetSeqCopyC (ajpseqt) 


seq = str(c_char_p(seq).value, ’utf-8’) 


2.5 Calling Program 
to Program 
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print (seq) 
MSMVRNVSNQSEKLEILX 


The Ruby example: 
require ‘ffi’ 
module Emboss 


extend FFI::Library 
ffi_lib "./emboss.so" 


attach_function :ajTrnNewI, [:int], :pointer 
attach_function :ajSeqNewNameC, [:pointer, :pointer], 
pointer 
attach_function :ajTrnSeqOrig, [:pointer, :pointer, :int], 
pointer 
attach_function :ajSeqGetSeqCopyC, [:pointer], :string 
end 


trnTable = Emboss.ajTrnNewI (1) 

ajpseq = Emboss.ajSeqNewNameC ("atgtcaatggtaagaaatgtatcaaatca- 
gagcgaaaaattggaaattttgt", "Test sequence") 

ajpseqt = Emboss.ajTrnSeqOrig(trnTable,ajpseq,1) 

aa = Emboss.ajSeqGetSeqCopyC (ajpseqt) 

print aa, "\n" 


MSMVRNVSNOSEKLETLX 


In both cases the advantage of FFI is that it does not require to 
compile any source code, just loading the shared library and bind- 
ing what is needed. Python has a native library called ctypes, and 
more sophisticated libraries are available to help the programmer 
bind complex data structures and functions. Ruby has a dedicated 
gem called [ruby-ffi]. 

The Ruby and Python FFI outperforms all above methods at 
6257 Seq/s and 4787 Seq/s, respectively (see Fig. 1). Plotting the 
time in seconds spent to translate the sequences, Ruby and Python 
FFI are the lowest (quickest) in the whole comparison (see Fig. 2). 
The high speed points out that (1) the invoked Biopython and 
BioRuby functions are reasonably efficient at parsing FASTA, 
(2) the FFI-generated call stack is efficient for moving data over 
the local call stack, and (3) the EMBOSS transeq DNA to protein 
translation is optimal C code. 


Calling program to program is far more common than you may 
think because even when you run a program ina shell, such as Bash, 
you are calling program to program. You can invoke EMBOSS 
from the command line: 
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Translate DNA sequences to protein (lower is better) 
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Fig. 2 Number of seconds needed for processing mRNA to protein translation using cross-language calling 
with a range of programming resources. See Fig. 1 for the setup. The figure shows that for all the 
implementations, the time increases linearly with the number of sequences in input. R-Biostrings performs 
poorly with an upstart of 6.50 s and the highest slope. The cross-language Ruby-FFI, Python FFI, and Python- 
EMBOSS with an upstart slightly higher than Java have a very minimal slope; Ruby EH) has a nearly 


constant time 


2.6 Web Services 


transeq test-dna.fa test.pep 


transeq is written in C and runs at a very fast 23,478 Seq/s. 
Invoking above EMBOSS’ transeq in Python looks like this: 


os.system("transeq "+fn+" out.pep") 

for sed record in SeqIO.parse("out.pep", "fasta"): 
print (">",seq_record.id) 
seq = str(seq_record.seq) 


print (seq) 


and this combination runs at 4768 Seq/s. That is close to Python 
FFI and a third of the speed of transeq on its own because of Python 
parsing the output. Every parsing step has a cost attached. 


A discussion on bridging languages would not be complete if we 
did not include web services, particularly using REST API’s. Service 
like TogoWS and EBI web services which include EMBOSS transeq 
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(SOAP) offer functionality over http(s) and can be used from any 
programming language. Here a Ruby example of using TogoWS: 


## Invoke irb by loading BioRuby 


% irb -r bio 


## Create a TogoWS object 


>> togows = Bio::TogoWS: :REST.new 

=> #<Bio::TogoWS::REST:0x007f840faab9d8 @pathbase="/", 

@http=#<Net::HTTP togows.dbcls.jp:80 open=false>, 
@header={"User-Agent"=>"BioRuby/1.5.1"}, @debug=false> 


## Search for UniProt entries by keywords 
>> togows.search(’uniprot’, ‘lung cancer’ ) 
=> "KKLC1_MACFA\nKKLC1_HUMAN\nDLEC1_HUMAN\n ..... 


## Retrieve one UniProt entry (or multiple entries if you like) 
>> entry = togows.entry(’uniprot’, ‘KKLC1_MACFA’ ) 


## See the entry content 

>> puts entry 

ID KKLC1_MACFA Reviewed; 114 AA. 
AC Q4R717; 


## Convert the retrieved UniProt entry into FASTA format 

>> puts togows.convert(entry, ‘uniprot’, 'fasta’) 
>KKLC1_MACFA RecName: Full=Kita-kyushu lung cancer antigen 
1 homolog; 
MNVYLLLASGILCALMTVFWKYRRFORNTGEMSSNSTALALVRPSSTGLINSNTDNNLSV 
YDLSRDILNNF PHS TAMOKRILVNLTTVENKLVELEHTLVSKGFRSASAHRKST 


Web services can harness a lot of power because they use large 
databases and access up-to-date information. As an example, let’s 
generate RDF from above entry: 


## Retrieve PubMed entry and convert it into RDF/Turtle 
(or JSON or XML if you like) 

>> puts togows.entry(‘’pubmed’, ‘16381885’, ‘ttl’) 

@prefix de: <http://purl.org/dc/elements/1.1/> 

@prefix dcterms: <http://purl.org/dc/terms/> 

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
@prefix prism: <http://prismstandard.org/namespaces/2.0/ba- 
sic/> 


@prefix medline: <http://purl.jp/bio/10/pubmed/> 


<http://rdf.ncbi.nlm.nih.gov/pubmed/16381885> medline:pmid 
"16381885" 
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3 Discussion 


rdfs:label "pmid:16381885" 
dc:identifier "16381885" 
medline:own "NLM" ; 


Unfortunately, data centric web services can be slow, i.e., send- 
ing and retrieving data over the internet incurs large latency and 
throughput penalties. Sometimes they use powerful back ends, and 
it is possible to submit large batch jobs which compete with locally 
installed solutions. Examples are the BLAST service [16] and 
GeneNetwork [36]. 


The half-life of bioinformatics software is 2 years—Pjotr Prins 


In this chapter we show that there are many ways of bridging 
between computer languages. Cross-language interfacing is a 
topic of importance to evolutionary genomics (and beyond) 
because computational biologists need to provide tools that are 
capable of complex analysis and cope with the amount of biological 
data generated by the latest technologies. Cross-language interfac- 
ing allows sharing of code. This means computer software can be 
written in the computer language of choice for a particular purpose. 
Flexibility in choice of computer programming language allows 
optimizing of computational resources and, perhaps even more 
important, software developer resources, in bioinformatics. 

When some functionality is needed that exists in a different 
computer language than the one used for a project, a developer has 
the following options: either rewrite the code in the preferred 
language, essentially a duplication of effort, or bridge from one 
language to the other. For bridging, there are essentially two tech- 
nical methods that allow full programmatic access to functionality: 
through RPC or a local call stack. A third option may be available 
when functionality can be reached through the command line, as 
shown above with transeq. 

RPC function invocation, over a network interface, has the 
advantage of being language agnostic and even machine indepen- 
dent. A function can run on a different machine or even over the 
Internet, which is the basis of web services and may be attractive 
even for running services locally. RPC XML-based technologies, 
however, are slow because of expensive parsing and high data load. 
Our metrics suggest that it may be worth experimenting with 
binary protocols, such as Rserve and Apache Thrift. 

When performance is critical, e.g., when much data needs to be 
processed, or functions are invoked millions of times, a native call 
stack approach may be preferred over RPC. Metrics suggest that the 
EMBOSS C implementation performs well and that binding to the 
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native C libraries with FFI is efficient (see Fig. 2). Alternatively, it is 
possible to use R as an intermediate to C libraries. Interestingly, 
calling R libraries, many of which are written in C, may give higher 
performance than calling into native Bio* implementations. For 
example, Python+RPy2+GeneR is faster that Biopython pure 
Python implementation of sequence translation, and it is also faster 
than R calling into GeneR directly—confirming a common com- 
plaint that R can be slow. 

Even though RPC may perform less well than local stack-based 
approaches, RPC has some real advantages. For example, if you 
have a choice of calling a local BLAST library or call into a remote 
and ready NCBI RPC interface, the latter lacks the deployment 
complexity. Also the public resource may be more up to date than a 
copied server running locally. This holds for many curated services 
that involve large databases, such as PDB [37], Pfam [38], KEGG 
[39], and UniProt [40]. Chapter 25 gives a deeper treatment of 
these Internet resources. 

From the examples given in this chapter, it may be clear that 
actual invocation of functions through the different technologies is 
similar, i.e., all listed Python scripts look similar, provided the 
underlying dependencies on tools and libraries have been resolved. 
The main difference between implementations is with deployment 
of software, rather than invocation of functionality. The JVM 
approach is of interest, because it makes bridging between sup- 
ported languages transparent and deployment straightforward. Not 
only can languages be mixed, but also the advanced Java tool chain 
is available, including debuggers, profilers, load distributors, and 
build tools. Other shared virtual machines, such as .NET and 
Parrot, potentially offer similar advantages but are less used in 
bioinformatics. 

In the first edition, we wrote that when striving for reliable and 
correct software solutions, the alternative strategy of calling com- 
puter programs as external units via the command line should be 
discouraged: not only is it less efficient that a program gets started 
every time a function gets called, but also a potential deployment 
nightmare is introduced. What happens when the program is not 
installed, or the interface changed between versions, or when there 
is some other error? With the full programmatic interfaces, dis- 
cussed in this chapter, incompatibilities between functions get 
caught much earlier. In this edition of the chapter, we add that 
efficiency considerations still hold, and error handling can be prob- 
lematic. When it comes to deployment, however, there now exist 
solutions that fixate versions of software and give control of the 
dependency graph, i.e., a tool like transeq can be coupled with its 
exact version against your software. To ascertain coupling: first 
there are containers, such as offered by Docker, that allow for 
bundling software binaries. Second, some recent software distribu- 
tions allow for formal deployment solutions with reproducible 
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dependency graphs. If you want to know more, check the GNU 
Guix and NixOS projects. It is possible to combine these deploy- 
ment technologies. In fact, with this chapter, we provide tools and 
scripts defined as GNU Guix packages and hosted in a Docker 
container. These solutions are discussed in Chapter 25. 

Choosing a computer language should not be based on run- 
time performance considerations alone. The maturity of the lan- 
guage and accompanying libraries, tools, and documentation 
should count heavily, as well as the activity of the community 
involved. The time saved by using a known language versus 
learning a new language should be factored in. The main point 
we are trying to make here is that it is possible to mix languages 
using different interfacing strategies. This allows leveraging existing 
functionality, as written by others, using a language of choice. 
Depending on one’s needs, it is advisable to test possible alterna- 
tives for performance, as the different tests show that performance 
varies. 

Whichever language and bridging technology is preferred, we 
think it important to test the performance of different ways of 
interfacing languages, as there is (1) a need for combining lan- 
guages in bioinformatics and (2) it is not always clear what impact 
a choice of cross-language interface may have on performance. By 
testing different bridging technologies and functional implementa- 
tions, the best solution should emerge for a specific scenario. 

So far, we have focused on the performance of cross-language 
calling. In Chapter 25, scalability of computation is discussed by 
programming for multiple processors and machines. 


l. Install the Docker container and run different tests. Can you 
replicate the differences of throughput statistics? 


2. Why are network protocols such as Rserve slower than native 
call stack approaches? 


3. What are possible advantages of using a virtual machine, such as 
the JVM? 


4. If you were to bridge between your favorite language and an R 
library, what options do you have? 


We thank all open-source software developers for creating such 
great tools and libraries for the scientific community. 
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